Proposal: SynthDefs could eliminate duplicate UGens

Hello,

Here (3.11.2):

LPF.ar.isKindOf(PureUGen) == true
LPF.superclass.superclass == PureUGen

Is the use of a random number generator the only way for a UGen to be non-deterministic?

Demand UGens also behave differently if shared, ie.

// Dseq ; shared dseq, different patterns
{var a = Dseq.new(list: [1,3,2,7,8], repeats: inf)
;var t = Impulse.kr(freq: 5, phase: 0)
;var f = Demand.kr(trig: t, reset: 0, demandUGens: [a,a]) * 30 + 340
;SinOsc.ar(freq: f, phase:  0) * 0.1}

// Dseq ; distinct dseq, equal patterns
{var a = Dseq.new(list: [1,3,2,7,8], repeats: inf)
;var b = Dseq.new(list: [1,3,2,7,8], repeats: inf)
;var t = Impulse.kr(freq: 5, phase: 0)
;var f = Demand.kr(trig: t, reset: 0, demandUGens: [a,b]) * 30 + 340
;SinOsc.ar(freq: f, phase:  0) * 0.1}

Best,
Rohan

Ps. Some lists.

Here (3.11.2) the list of UGens marked Pure is:

A2K APF AllpassC AllpassL AllpassN AmpComp AmpCompA BAllPass BBandPass BBandStop BHiPass BHiShelf BLowPass BLowShelf BPF BPZ2 BPeakEQ BRF BRZ2 COsc CombC CombL CombN Decay Decay2 DegreeToKey Delay1 Delay2 DelayC DelayL DelayN DetectIndex DetectSilence FOS Formant Formlet FreeVerb HPF HPZ1 HPZ2 Impulse Index IndexInBetween IndexL Integrator K2A LFCub LFPar LFPulse LFSaw LFTri LPF LPZ1 LPZ2 Lag Lag2 Lag2UD Lag3 Lag3UD LagUD LeakDC LinExp Median MidEQ MoogFF OnePole OneZero Osc OscN RHPF RLPF Ramp Resonz Ringz SOS Select Shaper SinOsc SinOscFB Slew Slope SyncSaw T2A T2K TwoPole TwoZero VOsc VOsc3 VarLag VarSaw Vibrato WrapIndex

(Vibrato is an error, I think. The rateVariation and depthVariation are noise level controls.)

But it’s less work to annotate things the other way about, which I have as:

BrownNoise ClipNoise CoinGate Dbrown Dbufrd Dbufwr Dconst Dgeom Dibrown Diwhite Dpoll Drand Dreset Dseq Dser Dseries Dshuf Dstutter Dswitch Dswitch1 Dunique Dust Dust2 Dwhite Dwrand Dxrand ExpRand Gendy1 Gendy2 Gendy3 GrayNoise IRand LFClipNoise LFDClipNoise LFDNoise0 LFDNoise1 LFDNoise3 LFNoise0 LFNoise1 LFNoise2 LinRand LocalBuf NRand PV_BinScramble PV_RandComb PV_RandWipe PinkNoise Rand TExpRand TIRand TRand TWindex Vibrato WhiteNoise

Which then gives the list of un-annotated Pure UGens as:

Amplitude Balance2 Ball BeatTrack BeatTrack2 BiPanB2 Blip BlockSize BufAllpassC BufAllpassL BufAllpassN BufChannels BufCombC BufCombL BufCombN BufDelayC BufDelayL BufDelayN BufDur BufFrames BufRateScale BufRd BufSampleRate BufSamples BufWr CheckBadValues Clip Compander CompanderD ControlDur ControlRate Convolution Convolution2 Convolution2L Convolution3 Crackle CuspL CuspN DC DecodeB2 DelTapRd DelTapWr Demand DemandEnvGen DiskIn DiskOut Done Duty EnvGen FBSineC FBSineL FBSineN FFT FSinOsc Fold Free FreeSelf FreeSelfWhenDone FreeVerb2 FreqShift GVerb Gate GbmanL GbmanN GrainBuf GrainFM GrainIn GrainSin Hasher HenonC HenonL HenonN Hilbert IEnvGen IFFT In InFeedback InRange InRect InTrig InfoUGenBase KeyState KeyTrack Klang Klank LFGauss LagIn LastValue Latch LatoocarfianC LatoocarfianL LatoocarfianN LeastChange Limiter LinCongC LinCongL LinCongN LinPan2 LinXFade2 Line Linen LocalIn LocalOut Logistic LorenzL Loudness MFCC MantissaMask ModDif MostChange MouseButton MouseX MouseY NodeID Normalizer NumAudioBuses NumBuffers NumControlBuses NumInputBuses NumOutputBuses NumRunningSynths OffsetOut Onsets Out PSinGrain PV_Add PV_BinShift PV_BinWipe PV_BrickWall PV_ConformalMap PV_Conj PV_Copy PV_CopyPhase PV_Diffuser PV_Div PV_HainsworthFoote PV_JensenAndersen PV_LocalMax PV_MagAbove PV_MagBelow PV_MagClip PV_MagDiv PV_MagFreeze PV_MagMul PV_MagNoise PV_MagShift PV_MagSmear PV_MagSquared PV_Max PV_Min PV_Mul PV_PhaseShift PV_PhaseShift270 PV_PhaseShift90 PV_RectComb PV_RectComb2 Pan2 Pan4 PanAz PanB PanB2 PartConv Pause PauseSelf PauseSelfWhenDone Peak PeakFollower Phasor Pitch PitchShift PlayBuf Pluck Poll Pulse PulseCount PulseDivider QuadC QuadL QuadN RadiansPerSample RandID RandSeed RecordBuf ReplaceOut Rotate2 RunningMax RunningMin RunningSum SampleDur SampleRate Sanitize Saw Schmidt SendTrig SetResetFF SpecCentroid SpecFlatness SpecPcile Spring StandardL StandardN Stepper StereoConvolution2L SubsampleOffset Sum3 Sum4 Sweep TBall TDelay TDuty TGrains Timer ToggleFF Trig Trig1 VDiskIn Warp1 Wrap XFade2 XLine XOut ZeroCrossing MaxLocalBufs MulAdd SetBuf

Ah… I remembered it incorrectly. I looked at PureUGens a few years ago and I found a lot of them, e.g. Pulse, which AFAICS should be pure but aren’t.

Somehow I thought Filter did not inherit from PureUGen but now I see that was faulty memory.

Notes for the “should-be-pure” list:

  • PV_ units should not be pure because they modify buffer contents.
  • Buf* delays modify buffer contents, as do DiskIn and DiskOut.
  • Chaos and noise units should (probably?) not be pure because they change a random number generator’s state. RandSeed can’t be pure for the same reason.
  • *Out units modify buses.
  • SendReply / SendTrig send OSC.
  • Probably other types of side effects should be considered as well.

IMO PureUGen should be reserved for units that have absolutely no side effects outside of their direct output (and “side effects” should be broadly, liberally considered because it would be very bad if the user expects x and y side effects, but gets only x). Of course UGens in general update their member variables, but member variables are private so that’s not a side effect. Writing to buses or buffers, sending messages, etc are different from that.

For that reason, PureUGen probably is a valid check – even if two “side-effect-y” UGens are behaviorally deterministic and have the same input, if there are two of them, the user expects two side effects, so we can’t drop one.

hjh

Hello,

I think the Chaos UGens can be considered similarly to SinOsc, they’re just ordinary equations over their parameters?

Also, I mentioned Pure because it’s nice that it already exists!

But also, it’s not very clear what it means.

The help for Pure says it’s for both “common subexpression elimination” and “dead code elimination”.

In:

{SinOsc.ar(freq: 440, phase: 0) * 0.1
;SinOsc.ar(freq: 220, phase: 0) * 0.1}

the latter happens (very quietly!).

In:

{var a = SinOsc.ar(freq: 220, phase: 0)
;var b = SinOsc.ar(freq: 220, phase: 0)
;(a + b)* 0.05}

the former doesn’t.

In:

{var a = SinOsc.ar(freq: 220, phase: 0) * 0.1
;ReplaceOut.ar(bus: 0, channelsArray: a)
;ReplaceOut.ar(bus: 0, channelsArray: a)}

the latter doesn’t (although the “effect” is “dead”?)

As an aside, I’d be curious about interesting graphs that have multiple “buffer-side-effecting” UGens with equal parameters, if anyone reading has examples?

Where “interesting” is meant to exclude things like:

{var a = SinOsc.ar(freq: 220, phase: 0) * 0.05
;Out.ar(bus: 0, channelsArray: a)
;Out.ar(bus: 0, channelsArray: a)}

Best,
Rohan

AFAIK there is no common subexpression elimination logic currently, so the helpfile is … uhm… optimistic :laughing:

I tend to think that a better starting place is not to drop any side effects, even if (as in ReplaceOut) one side effect renders the other moot.

And yes, chaos units without RNGs could be pure. (Should be all of them.)

hjh

… wrote the code in the back of a tour bus as the battery ran out on my laptop and just wanted to share something I was working on with another user.

This also underscores: SC is a livecoding tool. A lot of times the goal really is get something out as soon as possible because people on the dance floor want to hear something new. Often there’s no time for elegance or code review.

I think it’s more than just new/inexperienced users who’d benefit from this, and I love the idea of at least getting some metrics on how many real-world SynthDefs would benefit from CSE.

A very very quick first step (procrastinating… but really just about 30 minutes).

+ SynthDef {
	checkCSE {
		var units = IdentityDictionary.new;
		var duplicates;
		var lookup;
		children.do { |unit|
			if(unit.isKindOf(PureUGen)) {
				lookup = units[unit.lookupKey];
				if(lookup.isNil) {
					lookup = Dictionary.new
					.put(unit.inputs, unit);
					units.put(unit.lookupKey, lookup);
				} {
					if(lookup[unit.inputs].notNil) {
						duplicates = duplicates.add(unit);
					} {
						lookup.put(unit.inputs, unit);
					};
				}
			}
		};
		duplicates.do { |unit|
			"SynthDef '%' duplicate unit: %(%)"
			.format(name, unit.dumpName,
				unit.inputs.collect { |in|
					if(in.respondsTo(\dumpName)) { in.dumpName } { in }
				}.join(", ")
			)
			.postln;
		};
		^duplicates
	}
}

+ UGen {
	lookupKey { ^this.class.name }
}

+ BasicOpUGen {
	lookupKey { ^operator }
}

Here, I found a catch: BasicOpUGen is not pure! That’s possibly because some of the operators use random number generators. So the test will have to be changed.

But, just for the sake of a proof of concept, change it to BasicOpUGen : PureUGen. Then:

(
SynthDef(\test, {
	var a = SinOsc.ar(freq: 220, phase: 0)
	;var b = SinOsc.ar(freq: 220, phase: 0)
	;var c = a + b
	;Out.ar(0, (a + b) * 0.05)
	;Out.ar(100, c)
}).checkCSE;
)

SynthDef 'test' duplicate unit: 1_SinOsc(220, 0)
SynthDef 'test' duplicate unit: 4_+(0_SinOsc, 1_SinOsc)

That’s not going as far as eliminating the common subexpressions, but it does find them (even with UGens as inputs).

How costly is it? This is my worst case, of a large number of same-class units with different inputs.

(
f = { |n = 100|
	bench {
		d = SynthDef(\additive, {
			var oscs = Array.fill(n, { SinOsc.ar(ExpRand(200, 300)) });
			Out.ar(0, (oscs.sum * 0.001).dup)
		});
	};
	
	bench { d.checkCSE };
};
)

f.(200);
time to run: 0.12884541599988 seconds.
time to run: 0.0033576989999347 seconds.

f.(500);
time to run: 0.56893285199999 seconds.
time to run: 0.0082966729999043 seconds.

f.(1000);
time to run: 2.130532053 seconds.
time to run: 0.014083072999938 seconds.

So… the hash lookup absolutely smokes it (appears to be more or less O(n)).

The final elimination stage would add some cost, but this does suggest that concerns about the performance of a CSE pass were overblown.

hjh

1 Like

I optimized an audio processing graph using a few common techniques in an experimental’ toy’ project. Still, the ugens were not compiled before the synthdef, and I just work with graph data types directly, all in C++. The optimizations (nothing new) are topological sorting, constant node preprocessing, buffer pooling, and eliminating redundant buffers (the last two are ways to recycle buffers). That’s all. (CSE would not improve much I think, since unordered_map already helps with node-to-buffer mappings, but that’s something else to try),

SC probably implements all those things, but I’m unfamiliar with the details.

The result was a 3.5-fold increase in performance in some random examples, even with simple graphs. It doesn’t need complex graphs (one simple one went from 42us to 12us). I suspect (experts can tell me better) that the compilers applied some SIMD optimizations by themselves since the gain was a bit too much. I also use a SIMD lib but haven’t bench it yet. Even basic optimizations can make a big difference, it seems.

Btw I’ve been using this in production for several months. It works.

This approach doesn’t use a CSE pass – it keeps UGens in a cache according to UGen class, rate and arguments. Only pure math ops are cached in that version but it could be extended to other pure UGens.

hjh

1 Like

@jamshark70 don’t suppose you have any benchmarks, or anecdotal opinions about the performance?

Not really. It probably slows down SynthDef building slightly, and the redundant ops that it eliminates are already pretty fast.

The benefit could be increased by marking more UGens as pure.

The SynthDef-building speed penalty could be reduced by making it possible to determine whether a UGen is pure or not without creating the UGen instance. That’s straightforward for most UGens, but not so easy for unary or binary math ops. Currently it creates the instance, and if pure, checks for a cache match, and if there’s a hit, discards the new instance. That could be streamlined. (This is one reason why I called it a proof-of-concept – it’s enough to prove the concept but not ready to merge.)

hjh

I ask because I was curious if this could have a negative effect due to cache.

If you had r1 = a + b + c; r2 = a + b + d it could be faster to not pull out the a + b if it avoids a cache miss. I wonder if the vectorised math ops might suffer a performance loss from this proposal?

I’m not referring to CPU cache at all.

Without the UGenCache, r1 will end up as a Sum3 with a, b and c as inputs, and r2 as a different Sum3 with a, b and d as inputs.

With UGenCache (this is without testing – just guessing and writing on my phone), a + b would be done once on its own, and shared between r1 = that + c and r2 = that + d.

I don’t know which is more efficient. The pair of Sum3 units does 4 additions (one redundant) but only 2 calc function dispatches. The other way does one less addition but one more function dispatch. (But this is only an issue because the Sum3 optimization exists – I guess that conflicting optimizations could be hard to resolve.)

Tbh this is such a fine level of performance analysis that the vast majority of cases I’d guess would be indistinguishable in terms of speed.

hjh

1 Like

We should never forget that performance is opaque in most languages, not only C++. Haskell can be hard to optimize without going down in the intermediate language. ALSO, c++ and optimization, such as SIMD operations, can change from hardware to hardware from time to time since hardware changes.

And SC’s library is not maintained anymore; I am just saying…

Google has a good framework for benchmarking C++ so that you can compare nanoseconds to your changes. Haskell has a popular library, too; I use them when that’s the goal. For example, I measured the experiments I described in this discussion. In the case of real-time audio, you need to know not only it has good average performance, but predictable performance is as important or even more important,

Without those things, it would be just guessing, which is problematic. Even then, you might not know precisely what happened.

Some things, like topological sorting, are uncontroversial because the order of computation is transparent to us. Other things, not so much.

Example: LLVM, with all the hype on optimization, delivered a terrible performance in many, many compilers: Don't Bother With LLVM for Hobby Compilers

So, let’s be careful with optimization strategies without working on tools everyone can measure; let’s make this a habit. The risk is not thinking about our data (our material) and starting to create models in our heads.

I’m just saying: back to basics, and work on tools specific to SC.

@jamshark70, your changes are not aggressive and seem fine to me. I’m talking in general terms here.

Just an idea.

So I made some rough benchmarks to see what would happen and it seems like I’ve been able to answer my own question.

TLDR: Doing less math ops is better that doing more local ops. Seems that reducing the number of function calls should be the priority.

I think this result would suggest that Sum3 is going to be faster.


I tested
Duplicate

a + b + c;
a + b + d;

against No Duplicate.

r = a + b;
r + c;
r + d;
#include <span>
#include <array>
#include <vector>
#include <cstdint>
#include <iostream>

static constexpr size_t N = 1024;

using signal_span = std::span<float, N>;
using signal_store = std::array<float, 10'000>;
using Fn = void(*) (signal_span lhs, signal_span rhs, signal_span output);

void add(signal_span lhs, signal_span rhs, signal_span output) {
  for(size_t i{0}; i < N; ++i)
    output[i] = lhs[i] + rhs[i];
}

void mul(signal_span lhs, signal_span rhs, signal_span output) {
  for(size_t i{0}; i < N; ++i)
    output[i] = lhs[i] * rhs[i];
}

struct Instruction {
  Fn func;
  int16_t lhs;
  int16_t rhs;
  int16_t output;
};

static void duplicate(benchmark::State& state) {
  float signal_store[1'000'000];

  std::vector<Instruction> instr;
  instr.emplace_back(add, 0, N, 2 * N); //a + b
  instr.emplace_back(add, 0, N, 2 * N); // b + c
  
  instr.emplace_back(add, 1000 + N, 1000 + N*2, 1000 + N*3); // a + b
  instr.emplace_back(add, 10000 + N, 10000 + N*2, 10000 + N*3); // b + d

  for (auto _ : state) {
    for(auto& ins: instr){
      ins.func(
        std::span<float, N>(signal_store + ins.lhs, N), 
        std::span<float, N>(signal_store + ins.rhs, N), 
        std::span<float, N>(signal_store + ins.output, N)
      );
    }

    benchmark::DoNotOptimize(instr);
  }
  float ff = 0.0;
  for(auto& f : signal_store)
    ff += f;
  std::cout << ff << std::endl;
}
BENCHMARK(duplicate);

static void no_duplicate(benchmark::State& state) {
  float signal_store[1'000'000];
  std::vector<Instruction> instr;
  instr.emplace_back(add, 0, N, N*2); // a + b

  instr.emplace_back(add, N*2, 1000, 1000 + N); // b + c
  instr.emplace_back(add, N*2, 100'000, 100'000 + N); // b + d assume big jump between input and output

  for (auto _ : state) {
    for(auto& ins: instr){
      ins.func(
        std::span<float, N>(signal_store + ins.lhs, N), 
        std::span<float, N>(signal_store + ins.rhs, N), 
        std::span<float, N>(signal_store + ins.output, N)
      );
    }

    benchmark::DoNotOptimize(instr);
  }
  float ff = 0.0;
  for(auto& f : signal_store)
    ff += f;
  std::cout << ff << std::endl;
}
BENCHMARK(no_duplicate);

Some things I learnt.

If the inputs and output pointers alias things are much slower, I guess the compiler adds a check for this.

Block size has no measureable effect.

Is the “no_duplicate” method approximately 25% faster? That’s it?

(something like this to measure sc-specific code would be very good)

For a large block size, yes.
For a small block size, this decreases

This is a block size of 1.

Despite the non-locality of data here, its still faster. Seems the number of function calls is what should be avoided (assuming same implementation).

The assembly code shows

duplicate: 25.64% on moves, ‘copy’ float values between memory, a lot of time moving data around, rather than computing stuff.

no_duplicate:

54.05% on addss (actual computation number addition)
14.86% on movss (much less time moving things around in memory)

This is not suprising. The “process” function calls have a roughly constant overhead. If the block size is large, the actual DSP outweights the function call overhead; if the block size is very small, the function call overhead becomes more and more prominent. Once the block size drops below 8 samples, things get significantly worse because we can’t use the full width of the SIMD registers.

Note that the “process” function call overhead also includes any prolog/epilog code (e.g. saving member variables into local variables and vice versa).

@smoge I think you’re only counting the actual DSP code. If you include the constant factors, as explained above, you will get a different picture.

1 Like

Another comment: always take microbenchmarks with a big grain of salt. If you only test with a handful of Ugens, you might not get the full picture. In particular, negative cache effects only appear once there’s enough data to fill up the L1 cache.

@jordan It seems like you construct only a single “Synth” (with 3 resp. 4 “Ugens”) and then process the same Synth over and over again. Try the same with more Synth instances and compare the results!

1 Like

In practice, you need to test with three independent variables:

  1. duplicate vs. no_duplicate
  2. block size
  3. number of Synths/Ugens