Proposal: Refactor synthdef ugen optimizations

Spacechild1 · July 23, 2024, 1:30pm

Another problem with synthDefReady: when a UGen registers itself in the init method and the SynthDef graph function throws an exception, the UGen has no chance to remove itself and will stay in the dependents dictionary forever. So it’s really not a robust solution… I guess I will have to stick with my optimizeGraph hack for now

jordan · July 23, 2024, 1:38pm

It will also potentially misfire if the UGen is removed from the graph in an optimisation pass, so yea, it’s basically unusable.

smoge · July 23, 2024, 1:44pm

So update() is just an anti-feature? (kidding)

Another thing is that PureUGen is a superclass that says very little (and doesn’t match what I tried to say). It separates the ugens that do not connect to an audio buffer or IO(). All ugens subgraphs and graphs are all composable functions. Functors,etc

A design would need to be done based on concrete analysis to do something notable. The theory is well developed, but we can’t just treat an audio graph as an abstract (textbook) graph.

EDIT: and that can eventually evolve as a composition method

jordan · July 23, 2024, 2:10pm

That is incredibly important information.

Literally the purpose of this thread, expect its not an audio graph, but a UGen graph. The latter is much simpler.

The challenge is to:

Design the SynthDef compiler in a way that doesn’t over engineer anything (many of these abstract idea you keep referring to fall under this category)
Stop the optimisation passes from interfering with each other. I’ve already started working on this and will have something shortly.
Make things extendable in a limited number of specified ways — this isn’t x86, we don’t need that much flexibility as UGens are severely limited.
Make things easier to understand by avoiding mutating the graph from within Nodes.
Don’t break backwards compatibility (or do so in as limited way as possible).

smoge · July 23, 2024, 2:12pm

What is a UGen graph different than a GraphDef in general? Why is it more simple?

smoge · July 23, 2024, 2:18pm

The only idea I reiterated here in different ways was to work with graphs as graphs. The " abstract" idea precisely ignores the data, not my “theory.” Please don’t take it wrong, but don’t exaggerate by saying this.

What are the optimization passes here in your sense? You haven’t explained.

Have a good day, buddies.

jordan · July 23, 2024, 2:50pm

I said audio graph, which can also cover things like threading, feedback, sample rate changes, etc… Graphdef is the c++ representation of a synthdef (I think, it’s irrelevant to this conversation either way unless the idea is to move optimizations to c++ which is a huge job and I think I bad idea).

The ugen graph is simple because the are a limit number of ways you can transform it as you’re limit by the ugen interface. It’s not like x86 where there thousands and moves or adds - there are 4 adds (+, Sum3, Sum4, Mix) and they are very simple. Even the stuff in the PV classes is fundamentally simple.

…work with graphs as graphs.

Hence this thread.

What are the optimization passes here in your sense? You haven’t explained.

The ones currently being applied. This thread proposes changes to the API and internal algorithms to achieve the goals listed above (hence the refactoring in the title).

As mentioned before, these passes are pattern matching/ replacement of certain binary ops and PV_ ugens, dead code elimination, and @jamshark70 's common subexpression elimination (because it’s great), followed by the topological sort, indexing, and the final hook discussed above.

I’ll post the interface I’ve designed for this in a day or so once I’ve got a proof of concept working.

UGen author will be able to register replacement functions, and respond once finalised (but before sending), nothing else (unless someone can put forward a good reason why they need more power).

smoge · July 23, 2024, 2:59pm

Sorry, I don’ t understand the difference between ugen graph and synthdef (which IS a graphdef, and while I used graphdef in a slightly generalized way, I don’t understand your paragraph.

I see. Ok, sounds good.

Yesterday, I was told by others it was impossible since it was an exotic language feature. I am glad the collective spirit is evolving.

Spacechild1 · July 23, 2024, 3:24pm

Sorry, I don’ t understand the difference between ugen graph and synthdef

He said audio graph, not synthdef. He did not specifically compare it with a scsynth graph, but with audio graphs in general.

No, I said that sclang does not have pattern matching as a language feature, but it has its own ways to do the same thing.

muellmusik · July 23, 2024, 3:49pm

As someone who’s touched this code in the past, do proceed with considerable care. More documentation would be great, and maybe more testing? Given this has been a problematically murky area perhaps some comparisons between sorts from the previous and refactored versions, the including pure and non-pure UGens, etc., would be helpful. Any discrepancies would indicate possible areas for clarification. Thanks for looking at this tricky bit of the code base!

smoge · July 24, 2024, 10:17am

I said that too! We’re all saying the same things now… That’s good I guess!

jordan · July 24, 2024, 2:29pm

So I’ve been reading the code pretty thoroughly and there are two parts I am unsure of.

What exactly is a WidthFirstUGen? Things like LocalBuf, FFT, PV_* inherit from this, but why?

Secondly, what is the array UGen:widthFirstAntecedent for and how does it differ from a normal antecedent?

I suspect (but am not sure) this is to do with making sure buffers (and IO in general) are read and written to in the correct order? If so, why aren’t these relationships expressed using the normal antecedents? Is there some reason this won’t work that I can’t see yet? Can’t we just store the last IO UGen in the synthdef and when we get another, add the previous as an antecedent and update the stored, that way, they are always in the order they are written.

@muellmusik I believe you wrote much of this code, any insights?

muellmusik · July 24, 2024, 3:51pm

Yeah sure, 13 years ago! 100% fresh in my mind…

Yes, I’ll try to look in more depth (no pun intended!) later, but as I recall the sort goes depth first by default, I believe to allow wirebufs to be reused. But any UGen which has a side effect (e.g. writing to a buffer) can make a depth first sort invalid as the side effect may happen in the wrong order, i.e. antecedence may not fully capture dependency where there’s parallelism. So IIRC WidthFirstUGens are supposed to be added width first to avoid such issues.

Various solutions were tried at the time, and for reasons that I don’t immediately recall, this one stuck. Other implementations may also work of course, but as I said, tread carefully. Make sure if it produces a different outcome you can explain why, and why it’s better or at least doesn’t matter.

Tbh, while I’m no expert on this sort of thing, given that the original sort scheme is decades old now, I’d be curious to see some benchmarking to model the extent to which the optimisations still offer valuable performance advantages. I’m not saying get rid of them, but they do add a lot of complexity. If you’re refactoring it might be worth looking at it all from the ground up.

jordan · July 24, 2024, 4:13pm

Thanks!

What do you mean by parallelism here?

Because that should be perfectly expressible with the antecedents and a topological order.

My plan is for everything that isn’t pure to add the previous non-pure thing to it’s antecedents, this way, execution order follows the order it was written in the supercollider code, unless it’s pure.

jamshark70 · July 24, 2024, 5:55pm

One type of case where simple antecedence may not handle all the requirements is here: Classlib: Invert the SynthDef topological sort (to fix a wide-graph case) by jamshark70 · Pull Request #5811 · supercollider/supercollider · GitHub

I don’t recall whether this specific case uses widthFirstAntecedents but it is a case where the current topo sort gets it right basically by luck.

hjh

jordan · July 24, 2024, 6:19pm

Ah, thanks @jamshark70 didn’t think of random state seeding.

Topological sort can preserve the correct behaviour though. We’d have to track ugens that touch buffers and random state.

Can anyone think of anything else that has state in a synth?

Spacechild1 · July 24, 2024, 11:13pm

Yes! I think the primary reason has been multi-channel expansion because it naturally adds the UGens in width-first order. This can lead to excessive wire buffer usage. In fact, you can easily run out of wire buffers altogether.

Consider the following snippet:

(
~numOsc = 10;

SynthDef(\test, { |out, amp = 1.0|
	var sig = SinOsc.ar((1..~numOsc) + 440);
	var amps = { exprand(0.1, 1.0) } ! ~numOsc;
	sig = sig * amps;
	Out.ar(out, Mix.ar(sig) / ~numOsc * amp);
}).add.dumpUGens;
)

This is the output without topological sort:

test
[ 0_Control, control, nil ]
[ 1_SinOsc, audio, [ 441, 0.0 ] ]
[ 2_SinOsc, audio, [ 442, 0.0 ] ]
[ 3_SinOsc, audio, [ 443, 0.0 ] ]
[ 4_SinOsc, audio, [ 444, 0.0 ] ]
[ 5_SinOsc, audio, [ 445, 0.0 ] ]
[ 6_SinOsc, audio, [ 446, 0.0 ] ]
[ 7_SinOsc, audio, [ 447, 0.0 ] ]
[ 8_SinOsc, audio, [ 448, 0.0 ] ]
[ 9_SinOsc, audio, [ 449, 0.0 ] ]
[ 10_SinOsc, audio, [ 450, 0.0 ] ]
[ 11_*, audio, [ 1_SinOsc, 0.14781964574475 ] ]
[ 12_*, audio, [ 2_SinOsc, 0.61188809040249 ] ]
[ 13_*, audio, [ 3_SinOsc, 0.87976904193435 ] ]
[ 14_*, audio, [ 4_SinOsc, 0.49579341485918 ] ]
[ 15_*, audio, [ 5_SinOsc, 0.14011180115573 ] ]
[ 16_*, audio, [ 6_SinOsc, 0.61413128730507 ] ]
[ 17_*, audio, [ 7_SinOsc, 0.4674508871795 ] ]
[ 18_*, audio, [ 8_SinOsc, 0.3693163891174 ] ]
[ 19_*, audio, [ 10_SinOsc, 0.16267632509419 ] ]
[ 20_Sum4, audio, [ 14_*, 13_*, 12_*, 11_* ] ]
[ 21_Sum4, audio, [ 18_*, 17_*, 16_*, 15_* ] ]
[ 22_MulAdd, audio, [ 9_SinOsc, 0.23164356857488, 19_* ] ]
[ 23_Sum3, audio, [ 22_MulAdd, 21_Sum4, 20_Sum4 ] ]
[ 24_/, audio, [ 23_Sum3, 10 ] ]
[ 25_*, audio, [ 24_/, 0_Control[1] ] ]
[ 26_Out, audio, [ 0_Control[0], 25_* ] ]

This needs as many wire buffers as SinOscs! With ~numOsc = 100 you already exceed the default number of wire buffers (64).

And this is the output with topological sort:

test
[ 0_Control, control, nil ]
[ 1_SinOsc, audio, [ 441, 0.0 ] ]
[ 2_*, audio, [ 1_SinOsc, 0.21350452668567 ] ]
[ 3_SinOsc, audio, [ 442, 0.0 ] ]
[ 4_*, audio, [ 3_SinOsc, 0.14639397071191 ] ]
[ 5_SinOsc, audio, [ 443, 0.0 ] ]
[ 6_*, audio, [ 5_SinOsc, 0.43959053276532 ] ]
[ 7_SinOsc, audio, [ 444, 0.0 ] ]
[ 8_*, audio, [ 7_SinOsc, 0.30976199993617 ] ]
[ 9_Sum4, audio, [ 8_*, 6_*, 4_*, 2_* ] ]
[ 10_SinOsc, audio, [ 445, 0.0 ] ]
[ 11_*, audio, [ 10_SinOsc, 0.53345690432921 ] ]
[ 12_SinOsc, audio, [ 446, 0.0 ] ]
[ 13_*, audio, [ 12_SinOsc, 0.2133492885536 ] ]
[ 14_SinOsc, audio, [ 447, 0.0 ] ]
[ 15_*, audio, [ 14_SinOsc, 0.7878841337054 ] ]
[ 16_SinOsc, audio, [ 448, 0.0 ] ]
[ 17_*, audio, [ 16_SinOsc, 0.10154009291624 ] ]
[ 18_Sum4, audio, [ 17_*, 15_*, 13_*, 11_* ] ]
[ 19_SinOsc, audio, [ 449, 0.0 ] ]
[ 20_SinOsc, audio, [ 450, 0.0 ] ]
[ 21_*, audio, [ 20_SinOsc, 0.22857845276941 ] ]
[ 22_MulAdd, audio, [ 19_SinOsc, 0.64841385504448, 21_* ] ]
[ 23_Sum3, audio, [ 22_MulAdd, 18_Sum4, 9_Sum4 ] ]
[ 24_/, audio, [ 23_Sum3, 10 ] ]
[ 25_*, audio, [ 24_/, 0_Control[1] ] ]
[ 26_Out, audio, [ 0_Control[0], 25_* ] ]

I think this only needs 4 + ceil(~numOuts / 4) wire buffers: 4 buffers for the Sum4 inputs and 1 buffer for every Sum4 result. (Someone please do the math!)

For 10 SinOscs this is only 10 vs 7, but for 100 SinOscs it is 100 vs 29.

Note that this could be reduced to only 5 wire buffers – for all values of ~numOsc! – if every Sum4 instance would immediately sum into the final output, instead of summing all Sum4 outputs at the very end. On the downside, the final sum would be done by individual + UGens instead of Sum4 UGens. Regarding the Sum4 optimization, there seems to be an implicit tradeoff between the number of UGens and the number of required wire buffers.

Spacechild1 · July 24, 2024, 11:24pm

I did some rough benchmarks and I could hardly see any notable performance difference between the sorted and unsorted graph, even with very large numbers of SinOscs (up to 1000). Maybe someone can come up with an example where there is a significant difference?

Note: For now I only tested with scsynth which has a single array of wire buffers that is used by all Synths. In Supernova, however, every Synth has its own wire buffers (for thread-safety reasons), so the larger number of wire buffers could add up and harm overall performance.

Spacechild1 · July 24, 2024, 11:28pm

(For testing purposes I’ve added class members to SynthDef which allow me easily enable/disable certain optimization passes, so I can compare the results. For example, with SynthDef.useTopologicalSort I can enable/disable topological sorting.)

muellmusik · July 25, 2024, 2:06am

Sorry, poor choice of words perhaps. I mean cases where the ordering needs to reflect parallel streams or branches of UGens, i.e. where depth first sorts will fail.

Probably useful to start here:

WidthFirstUGen.dumpClassSubtree;
WidthFirstUGen
[
  PV_ChainUGen
  [
    PV_MagShift
    PV_MagDiv
    PV_JensenAndersen
    PV_MagFreeze
    PV_RandWipe
    FFT
    PV_MagAbove
      [ PV_MagClip PV_LocalMax PV_MagBelow ]
    PV_Diffuser
    PV_HainsworthFoote
    PV_Decorrelate
    PV_RectComb
    PV_BinShift
    PV_BinScramble
    PV_BrickWall
    FFTTrigger
    PV_BinWipe
    PV_RandComb
    PV_MagSquared
      [ PV_PhaseShift270 PV_MagNoise PV_Conj PV_PhaseShift90 ]
    PV_MagSmear
    PackFFT
    PV_RectComb2
    PV_PhaseShift
    PV_MagMul
    [
      PV_Max
      PV_Mul
      PV_Min
      PV_CopyPhase
      PV_Div
      PV_Copy
      PV_Add
    ]
    PV_ConformalMap
  ]
  LocalBuf
  RandSeed
  SetBuf
  ClearBuf
  RandID
  IFFT
]

That was the intent with WidthFirstUGen, IIRC.