Why you should always wrap Synth(...) and Synth:set in Server.default.bind { ... }

Spacechild1 · June 16, 2023, 12:28am

Please see Why you should always wrap Synth(...) and Synth:set in Server.default.bind { ... } - #10 by Spacechild1. The fundamental issue of synchronization/scheduling overhead is the same on every OS.

Another issue is that on Supernova every Synth gets its own wire buffers and local busses because it might execute in parallel with other Synths. This may cause significant memory overhead and cache misses. The smaller the Synths, the more pronounced the overhead. 16000 SynthOsc synths is probably the point where the model breaks down… But then again, it’s not exactly a realworld test scenario

However, future parallel server implementations should take this issue into account!

jamshark70 · June 16, 2023, 12:32am

I always wondered why that is. The number of DSP threads is known in advance and parallelism can’t exceed this. Wouldn’t it be enough to have one set of wire buffers per DSP thread?

hjh

Spacechild1 · June 16, 2023, 12:38am

Synths are not pinned to specific threads. On every DSP tick, the DSP tasks are pushed to a thread pool and any DSP thread might pop and execute them. The wire buffers, however, have to be set when the Synth is created.

Spacechild1 · June 16, 2023, 12:43am

As a side note: this is much less of a problem when the DSP graph is fixed. In fact, I haven working on a multi-threaded version of Pd (GitHub - Spacechild1/pure-data at multi-threading) and I only need to create new signal contexts at “fork points”.

In SuperCollider, however, the DSP graph can be rearranged freely and in real-time. Tricky stuff…

jamshark70 · June 16, 2023, 2:23am

I see. I’d assumed that, in one DSP tick, one synth would execute on one thread (could be a different thread next time), end there can never be more than one synth active in one thread – so I naively thought that the synth node could use wire buffers belonging to the thread. (I should also assume that Tim considered that and rejected it for some reason.)

I had quite good results from supernova in a piece where I was playing a lot of chords, although I can’t use it live because my MixerChannel sorting logic has sometimes crashed supernova due to too many group moves in rapid succession.

hjh

Spacechild1 · June 16, 2023, 12:43pm

(I should also assume that Tim considered that and rejected it for some reason.)

I think one limiting factor was the wish to stay compatible with scsynth as much as possible (which is a nice thing, course!). Alternatively, one possible solution could be to have wire buffers per ParGroup and fix up all Units in a Graph when moved between Groups. But this would require a significant change in the plugin API.

I would be great if you could find a somewhat reproducable example and open a ticket on GitHub. I have already fixed a few Supernova bugs in the past, so there is a good chance I can fix that as well. It would be great if Supernova were more stable and more users would feel more comfortable using it in their projects, so we can get more practical experience with various forms of real-life usage.

scztt · June 16, 2023, 3:18pm

I thought that ParGroup was a designator only of Synths that could be parallelized (e.g. don’t depend on each other), not a designator of which thread/executor they would be executed on? In which case, things inside a ParGroup are guaranteed to be executed in parallel (or at least have a high likelihood of this). But possibly I’m misunderstanding your suggestion?

I would imagine that the optimal solution would require only one set of wirebufs that would be re-used for every Synth, and in case of parallelized graph execution you would just need one set per thread/executor (not one per Synth). It’s been a while since I’ve looked at the architecture of supernova, maybe it doesn’t follow this - but in any case it should be at least theoretically the best option.

In my (very anecdotal) experience, supernova can be lower overhead for high UGen count synths (meaning: lots of SinOsc’s for e.g. additive). I haven’t tested for lots of independent synths though. I can’t imagine this would be significantly different for non-threaded /no-ParGroup cases - Tim is a performance nerd and would never have released that - but I could easily see a case where the cost of doing queue operations could overtake the cost of a trivial single SinOsc synth, in which case the performance would be noticably worse.

Spacechild1 · June 16, 2023, 10:00pm

That’s correct. I phrased that sentence very poorly. What I meant was:

One possible solution could be having dedicated wire buffers for each toplevel Node within a ParGroup, but child nodes would reuse the wire buffers of their parent. This would guarantee that wire buffers are isolated, while avoiding the overhead of always having separate wire buffers for each and every Synth. However, this would require to fix up all Units in a Graph when it (or one of its parents) is moved between Groups.

I would imagine that the optimal solution would require only one set of wirebufs that would be re-used for every Synth, and in case of parallelized graph execution you would just need one set per thread/executor (not one per Synth)

For simple chains, that would be the ideal solution indeed. Unfortunately, Synths are graphs, so in practice you would need to traverse the SynthDef (more specifically, its unit specs and corresponding wire specs) and fix up all (audio-rate) wires for every Synth on every DSP tick - which would be prohibitively expensive.

but I could easily see a case where the cost of doing queue operations could overtake the cost of a trivial single SinOsc synth, in which case the performance would be noticably worse.

Yes, that’s exactly what I think happens when you try to run 16000 SinOsc Synths in a single ParGroup. (I want to do some benchmarks, actually.)

I think there is some low hanging fruit for optimization. Currently, Supernova makes the pessimistic assumption that a ParGroup may contain wildly different Synths/Nodes, so they are all scheduled as individual tasks. However, if all the Synths/Nodes are roughly equal in terms of CPU cost, it would be better to partition them into N tasks, where N is the number of DSP tasks. On the user-facing side, this could be implemented as an additional (optional) argument for ParGroup. I have already put this on my TODO list

scztt · June 19, 2023, 2:27pm

Yes, looking over the wire implementations I think I was less clear on how this worked than I thought. I had imagined that the wirebufs were effectively storage for temporaries while calculating the Synth graph (where e.g. a totally linear chain like SinOsc.ar(SinOsc.ar(SinOsc.ar(440))) would be aliased into a single wire, not counting constants), and that UGen inputs and outputs were pre-calculated during the graph build process to indexes into this array of wires. The actual implementation doesn’t look exactly like this, I need to refresh my memory a bit here…

It already feels like a mistake that ParGroup is a user-facing object (graph partitioning afaik is a problem with a clear solution, or at least a solution that will do as good or better than any manual partitioning strategy a SuperCollider user might cook up - granted, given the design of the server, this is still tough to solve). I wonder if there’s a better / less manual approach here? Wouldn’t it be enough to sort the ParGroup by SynthDef when the graph is changed, and then have threads pull multiple nodes from the queue in cases where the total node count for a given SynthDef is significantly larger than the number of threads? The worst case of the sort would potentially ruin any performance benefits here, but there may be a pragmatic path to making this performant enough. It’s hard to imagine any solution that doesn’t make use of some kind of sort operation, however.

I guess there’s a meta-consideration here, which is that both SuperCollider servers are fundamentally not set up to efficiently process node counts in the many-thousands. I wonder if it’s worth the effort to optimize the “thousands of SinOsc Synths” case when this will always be highly non-optimal - and the general advice to any user would be to ABSOLUTELY avoid this.

Spacechild1 · June 19, 2023, 7:05pm

I had imagined that the wirebufs were effectively storage for temporaries while calculating the Synth graph (where e.g. a totally linear chain like SinOsc.ar(SinOsc.ar(SinOsc.ar(440))) would be aliased into a single wire, not counting constants), and that UGen inputs and outputs were pre-calculated during the graph build process to indexes into this array of wires.

That sounds about right. The actual wire buffers are just pointers into one contiguous array. In scsynth, the same array is used for all Synths and lives in HiddenWorld. In Supernova, each Synth has its own array.

Now, if you wanted to change the underlying array for a particular set of Synths, you would need to fixup all UGens so that the individual buffers point into the new array. For this you’d have to traverse the entire graph and lookup mWireIndex in the input and output specs of each Unit. You can probably get away with it if you only do it occasionally, but it’s not something you would do on each process tick.

It already feels like a mistake that ParGroup is a user-facing object (graph partitioning afaik is a problem with a clear solution).

In my understanding, graph partitioning only works if all the connections between nodes are static and visible. This is typically the case in a DAW, as VST plugins are only connected through their audio input and outputs. (Users can, of course, change the routing, in which case the graph would need to be recomputed.)

In scsynth/supernova, however, Nodes resp. UGens communicate via busses and bus indexes can be set dynamically. Moreover, I/O Ugens are effectively blackboxes, just like any other UGen, so the Server is not even aware of them.

IMO, ParGroup makes a lot of sense. One general problem with scsynth is that when you look at a Node, it is not immediately clear whether its children are supposed to run in series or parallel. ParGroup effectively says that all children are independent from each other, i.e. conceptually they run in parallel. Group, on the other hand, implies that children may run in series. Supernova happens to use this information to enable multiprocessing where possible, but in general I think it also helps to clarify the structure of the graph. If you mentally substitute Group with SerGroup, ParGroup starts to make more sense

One slight issue I have with Supernova’s multiprocessing is that it only supports “fork/join” multiprocessing. Another approach is “asynchronous pipelining” which also allow to parallalize serial signal chains, albeit with one-block-delays. My experimental multithreaded Pd fork (GitHub - Spacechild1/pure-data at multi-threading) actually supports both.

Actually, Tim discusses and evaluates several multiprocessing strategies in his (amazing) master thesis. I cannot recommend it enough. It’s a joy to read!

Wouldn’t it be enough to sort the ParGroup by SynthDef

ParGroup may contain other Groups. You could compare them recursively to figure out if they are equivalent, but I have the feeling it would be better to just let the user tell the Server…

I guess there’s a meta-consideration here, which is that both SuperCollider servers are fundamentally not set up to efficiently process node counts in the many-thousands.

Yes! One thing that would help is true multi-channel processing á la Max/MSP (and the upcoming Pd 0.54!). Sclang’s “multi-channel expansion” tends to hide the fact that the Server is fundamentally single-channel. Multi-channel processing not only improves cache locality, it also allows to vectorize certain operations that would be otherwise impossible to vectorize, such as oscillators or filters. With proper AVX instructions you can effectively compute 8 oscillators for the price of 1 (well, almost ).