Proposal: Refactor synthdef ugen optimizations

muellmusik · July 24, 2024, 3:51pm

Yeah sure, 13 years ago! 100% fresh in my mind…

Yes, I’ll try to look in more depth (no pun intended!) later, but as I recall the sort goes depth first by default, I believe to allow wirebufs to be reused. But any UGen which has a side effect (e.g. writing to a buffer) can make a depth first sort invalid as the side effect may happen in the wrong order, i.e. antecedence may not fully capture dependency where there’s parallelism. So IIRC WidthFirstUGens are supposed to be added width first to avoid such issues.

Various solutions were tried at the time, and for reasons that I don’t immediately recall, this one stuck. Other implementations may also work of course, but as I said, tread carefully. Make sure if it produces a different outcome you can explain why, and why it’s better or at least doesn’t matter.

Tbh, while I’m no expert on this sort of thing, given that the original sort scheme is decades old now, I’d be curious to see some benchmarking to model the extent to which the optimisations still offer valuable performance advantages. I’m not saying get rid of them, but they do add a lot of complexity. If you’re refactoring it might be worth looking at it all from the ground up.

jordan · July 24, 2024, 4:13pm

Thanks!

What do you mean by parallelism here?

Because that should be perfectly expressible with the antecedents and a topological order.

My plan is for everything that isn’t pure to add the previous non-pure thing to it’s antecedents, this way, execution order follows the order it was written in the supercollider code, unless it’s pure.

jamshark70 · July 24, 2024, 5:55pm

One type of case where simple antecedence may not handle all the requirements is here: Classlib: Invert the SynthDef topological sort (to fix a wide-graph case) by jamshark70 · Pull Request #5811 · supercollider/supercollider · GitHub

I don’t recall whether this specific case uses widthFirstAntecedents but it is a case where the current topo sort gets it right basically by luck.

hjh

jordan · July 24, 2024, 6:19pm

Ah, thanks @jamshark70 didn’t think of random state seeding.

Topological sort can preserve the correct behaviour though. We’d have to track ugens that touch buffers and random state.

Can anyone think of anything else that has state in a synth?

Spacechild1 · July 24, 2024, 11:13pm

Yes! I think the primary reason has been multi-channel expansion because it naturally adds the UGens in width-first order. This can lead to excessive wire buffer usage. In fact, you can easily run out of wire buffers altogether.

Consider the following snippet:

(
~numOsc = 10;

SynthDef(\test, { |out, amp = 1.0|
	var sig = SinOsc.ar((1..~numOsc) + 440);
	var amps = { exprand(0.1, 1.0) } ! ~numOsc;
	sig = sig * amps;
	Out.ar(out, Mix.ar(sig) / ~numOsc * amp);
}).add.dumpUGens;
)

This is the output without topological sort:

test
[ 0_Control, control, nil ]
[ 1_SinOsc, audio, [ 441, 0.0 ] ]
[ 2_SinOsc, audio, [ 442, 0.0 ] ]
[ 3_SinOsc, audio, [ 443, 0.0 ] ]
[ 4_SinOsc, audio, [ 444, 0.0 ] ]
[ 5_SinOsc, audio, [ 445, 0.0 ] ]
[ 6_SinOsc, audio, [ 446, 0.0 ] ]
[ 7_SinOsc, audio, [ 447, 0.0 ] ]
[ 8_SinOsc, audio, [ 448, 0.0 ] ]
[ 9_SinOsc, audio, [ 449, 0.0 ] ]
[ 10_SinOsc, audio, [ 450, 0.0 ] ]
[ 11_*, audio, [ 1_SinOsc, 0.14781964574475 ] ]
[ 12_*, audio, [ 2_SinOsc, 0.61188809040249 ] ]
[ 13_*, audio, [ 3_SinOsc, 0.87976904193435 ] ]
[ 14_*, audio, [ 4_SinOsc, 0.49579341485918 ] ]
[ 15_*, audio, [ 5_SinOsc, 0.14011180115573 ] ]
[ 16_*, audio, [ 6_SinOsc, 0.61413128730507 ] ]
[ 17_*, audio, [ 7_SinOsc, 0.4674508871795 ] ]
[ 18_*, audio, [ 8_SinOsc, 0.3693163891174 ] ]
[ 19_*, audio, [ 10_SinOsc, 0.16267632509419 ] ]
[ 20_Sum4, audio, [ 14_*, 13_*, 12_*, 11_* ] ]
[ 21_Sum4, audio, [ 18_*, 17_*, 16_*, 15_* ] ]
[ 22_MulAdd, audio, [ 9_SinOsc, 0.23164356857488, 19_* ] ]
[ 23_Sum3, audio, [ 22_MulAdd, 21_Sum4, 20_Sum4 ] ]
[ 24_/, audio, [ 23_Sum3, 10 ] ]
[ 25_*, audio, [ 24_/, 0_Control[1] ] ]
[ 26_Out, audio, [ 0_Control[0], 25_* ] ]

This needs as many wire buffers as SinOscs! With ~numOsc = 100 you already exceed the default number of wire buffers (64).

And this is the output with topological sort:

test
[ 0_Control, control, nil ]
[ 1_SinOsc, audio, [ 441, 0.0 ] ]
[ 2_*, audio, [ 1_SinOsc, 0.21350452668567 ] ]
[ 3_SinOsc, audio, [ 442, 0.0 ] ]
[ 4_*, audio, [ 3_SinOsc, 0.14639397071191 ] ]
[ 5_SinOsc, audio, [ 443, 0.0 ] ]
[ 6_*, audio, [ 5_SinOsc, 0.43959053276532 ] ]
[ 7_SinOsc, audio, [ 444, 0.0 ] ]
[ 8_*, audio, [ 7_SinOsc, 0.30976199993617 ] ]
[ 9_Sum4, audio, [ 8_*, 6_*, 4_*, 2_* ] ]
[ 10_SinOsc, audio, [ 445, 0.0 ] ]
[ 11_*, audio, [ 10_SinOsc, 0.53345690432921 ] ]
[ 12_SinOsc, audio, [ 446, 0.0 ] ]
[ 13_*, audio, [ 12_SinOsc, 0.2133492885536 ] ]
[ 14_SinOsc, audio, [ 447, 0.0 ] ]
[ 15_*, audio, [ 14_SinOsc, 0.7878841337054 ] ]
[ 16_SinOsc, audio, [ 448, 0.0 ] ]
[ 17_*, audio, [ 16_SinOsc, 0.10154009291624 ] ]
[ 18_Sum4, audio, [ 17_*, 15_*, 13_*, 11_* ] ]
[ 19_SinOsc, audio, [ 449, 0.0 ] ]
[ 20_SinOsc, audio, [ 450, 0.0 ] ]
[ 21_*, audio, [ 20_SinOsc, 0.22857845276941 ] ]
[ 22_MulAdd, audio, [ 19_SinOsc, 0.64841385504448, 21_* ] ]
[ 23_Sum3, audio, [ 22_MulAdd, 18_Sum4, 9_Sum4 ] ]
[ 24_/, audio, [ 23_Sum3, 10 ] ]
[ 25_*, audio, [ 24_/, 0_Control[1] ] ]
[ 26_Out, audio, [ 0_Control[0], 25_* ] ]

I think this only needs 4 + ceil(~numOuts / 4) wire buffers: 4 buffers for the Sum4 inputs and 1 buffer for every Sum4 result. (Someone please do the math!)

For 10 SinOscs this is only 10 vs 7, but for 100 SinOscs it is 100 vs 29.

Note that this could be reduced to only 5 wire buffers – for all values of ~numOsc! – if every Sum4 instance would immediately sum into the final output, instead of summing all Sum4 outputs at the very end. On the downside, the final sum would be done by individual + UGens instead of Sum4 UGens. Regarding the Sum4 optimization, there seems to be an implicit tradeoff between the number of UGens and the number of required wire buffers.

Spacechild1 · July 24, 2024, 11:24pm

I did some rough benchmarks and I could hardly see any notable performance difference between the sorted and unsorted graph, even with very large numbers of SinOscs (up to 1000). Maybe someone can come up with an example where there is a significant difference?

Note: For now I only tested with scsynth which has a single array of wire buffers that is used by all Synths. In Supernova, however, every Synth has its own wire buffers (for thread-safety reasons), so the larger number of wire buffers could add up and harm overall performance.

Spacechild1 · July 24, 2024, 11:28pm

(For testing purposes I’ve added class members to SynthDef which allow me easily enable/disable certain optimization passes, so I can compare the results. For example, with SynthDef.useTopologicalSort I can enable/disable topological sorting.)

muellmusik · July 25, 2024, 2:06am

Sorry, poor choice of words perhaps. I mean cases where the ordering needs to reflect parallel streams or branches of UGens, i.e. where depth first sorts will fail.

Probably useful to start here:

WidthFirstUGen.dumpClassSubtree;
WidthFirstUGen
[
  PV_ChainUGen
  [
    PV_MagShift
    PV_MagDiv
    PV_JensenAndersen
    PV_MagFreeze
    PV_RandWipe
    FFT
    PV_MagAbove
      [ PV_MagClip PV_LocalMax PV_MagBelow ]
    PV_Diffuser
    PV_HainsworthFoote
    PV_Decorrelate
    PV_RectComb
    PV_BinShift
    PV_BinScramble
    PV_BrickWall
    FFTTrigger
    PV_BinWipe
    PV_RandComb
    PV_MagSquared
      [ PV_PhaseShift270 PV_MagNoise PV_Conj PV_PhaseShift90 ]
    PV_MagSmear
    PackFFT
    PV_RectComb2
    PV_PhaseShift
    PV_MagMul
    [
      PV_Max
      PV_Mul
      PV_Min
      PV_CopyPhase
      PV_Div
      PV_Copy
      PV_Add
    ]
    PV_ConformalMap
  ]
  LocalBuf
  RandSeed
  SetBuf
  ClearBuf
  RandID
  IFFT
]

That was the intent with WidthFirstUGen, IIRC.

muellmusik · July 25, 2024, 2:09am

That thread perhaps contains some useful discussion from when this came up a couple of years ago.

I posted a couple of test cases here: Classlib: Invert the SynthDef topological sort (to fix a wide-graph case) by jamshark70 · Pull Request #5811 · supercollider/supercollider · GitHub

muellmusik · July 25, 2024, 2:25am

I have this weird feeling that was tried and there was still some issue (there was a lot of discussion about this), but sounds good!

muellmusik · July 25, 2024, 2:27am

Again would be nice to have a clear spec that can at least be tested against. I had a go in the linked GH thread, but was never confirmed:

AFAIK, this has never been explicitly documented (!) but here’s a rough go:

Any UGens that have a side-effect (e.g. modifying a buffer, sending a message) need to occur in the chain before any UGens that come later in the user-specified order, to ensure that side-effects occur in the right order.

Where possible, reorder depth first to allow wireBufs to be reused.

Where appropriate, use code-elimination strategies to remove unused UGens, unnecessary MulAdds etc.

NB 1. is crucial and top priority, while 2. and 3. are nice to haves. If unsure, we should implement with that in mind.

Is that spec complete and accurate? Is it worded correctly and precisely?

jordan · July 25, 2024, 7:22am

Okay so there seems to be 4 cases that creat additional edges in the graph. These are properties of nodes (ugens).

Bus access.
This involves creating an edge between the last access and the next. We could avoid creating edges between consecutive groups of Outs and Ins as the order doesn’t matter, although it does with replace out making this a little complicated.
Buffer access.
Same as before, but for groups of reads and writes.
Random state.
There needs to be an edge from the last random seed to all ugens that touch random state - these ugens should also have an edge to the next random state change. This assumes things like LFNoise can be reordered within a random seed boundary (seems reasonable?)
All other state. If plugins touch state that isn’t listed above they need a way of indicating so. This can just do the dumb thing and creat edges between each subsequent node that has this property.

@muellmusik your first point mentions send messages, should that be included here? Does message order matter? (If so I guess it’s currently broken as it’s not width first?).

jordan · July 25, 2024, 9:49am

With two types of edges (children: strong, io order: weak) depth first sorting can be re-defined as limiting the length of the strong edges.

muellmusik · July 25, 2024, 3:48pm

Yeah, I don’t have a specific use case where for example multiple sends in the same synth need to come in the indicated order, but I think the standard should rather be, can we rule that out? I don’t think we can.

In principle a synth should always behave like the explicitly specified order, so if there’s any doubt I’d say treat as width first.

This is partly why I idly wondered if the sort was even still worth it. Happy to accept that the answer is probably yes, but memory and CPU are way cheaper than they were decades ago, so it seemed worth asking.

Spacechild1 · July 25, 2024, 4:43pm

See my comment in Proposal: Refactor synthdef ugen optimizations - #39 by Spacechild1. That’s why I was asking if anyone can present an example where there’s a significant and consistent performance difference. And again, we must not forget to test on Supernova, where each Synth has its own set of wire buffers!

muellmusik · July 25, 2024, 6:14pm

I suppose small computers may have more constrained resources, so might be worth testing on a Pi or Beagle?

Spacechild1 · July 25, 2024, 7:19pm

The crucial factor are the cache sizes. Isolated micro benchmarks often have the problem that the cache effects only really come into play in larger applications.

Just for reference: A 64Kb L1 Cache can theoretically hold (64 × 1024 / 4 / 64) = 256 wire buffers at a 64 sample block size. (In practice it is less because if associativity.) A 512 KB L2 cache can hold 2048 wire buffers. Above that buffers will start to spill into the L3 cache (which is shared by several cores). With scsynth it is hard to reach this limit, but in Supernova it can happen easily if you instantiate lots of Synths. I’m curious about my test results

Side note: this really shows a design problem with Scsynth (which Supernova was forced to inherit): the Ugen input and output pointers are fixed at construction and owned by the Unit. If the pointers were instead passed to the process function (which is typical for audio plugins), you’d only need a set of buffers per thread.

jordan · July 26, 2024, 9:39am

Not being pedantic about terms but I was thinking about this little use of the term ‘topological’ and implying that the ‘non-sorted’ isn’t a topological ordering (which is what the code implies via names) and I think it captures why this code is complex.

The default way the ugens are added to the children array is a topological ordering - it’s just not a good ordering for limiting the number of wirebufs.

This is important because the code is treating this children array as the execution order (which by definition must be a topological order), which makes doing replacements or insertions complex as you need to preserve a valid execution order (a valid topological order) while the nuanced logic of inserting, deleting and replacing ugen is taking place.

Juggling these two constrains is why the code appears written ‘inside-out’.

By removing this children/execution-order array and storing the ugens directly as antecedents and descendants making changes to the graph is significantly easier, as the ordering is no longer a concern, but dealt with in one final step.

Topological sorting helps keep the code easier to understand by separating concerns, assuming all the constraints are expressed in the graph (which currently they are not).

Limiting wirebufs is just one such topological ordering and whether we adopt it, I think, should be independent of whether we sort the graph which we should always do.

muellmusik · July 26, 2024, 6:17pm

Well, yes, in that strict sense you don’t really have a choice as the sclang syntax forces you to do this. Nevertheless I think it’s fair to say that the intended order of execution in sclang is very clear, if sometimes ‘inside-out’. (if I understand you correctly).

jordan · July 26, 2024, 6:30pm

I meant the way the SytnhDef optimisations are written is inside out as you have to mutate the graph from inside the nodes (ugens), keeping a valid topological order as you make substitutions. The addCopies from the PV ugens is one example of this, it continually calls things like indexUGens, which are graph wide process, rather than just changing the edges and creating new nodes.