Question on Graph / Topological Sort

Why do we have problems with the order of execution if there are simple algorithms that, if not correct, at least warn the user about the errors? I’m just talking about regular nodes here, nothing more granular or with parallelization.

EDIT: I remember @scztt describing another approach that would be much simpler. Not sure it is related. (maybe he described something else)

In theory, any audio graph can find simple mathematical equivalents using pattern matching, or am I wrong?

Yeah, I think it’s a design problem that order-of-execution is a user problem in SuperCollider, though I understand why it probably evolved that way. For single-threaded execution, graph execution order more or less has one correct analytical answer (or, at least, a bunch of equivalent answers). I think if you’re executing the graph in parallel (e.g. supernova), there are different ways to organize graph execution that can have consequences in terms of performance and I seem to recall that this can be a somewhat difficult problem - but in terms of correctness, there are still just analytically correct or incorrect orderings. In a way, Group’s in general don’t really serve a functional purpose in SuperCollider - if node ordering was automatic, then the only use of Groups would be to make messaging multiple synths easier (useful, obviously, but only really a minor optimization for something that could be done other ways).

There’s a feature of the NodeSnapshot quark (GitHub - scztt/NodeSnapshot.quark) that probably nobody uses that will tell you if you have node order problems (though it only takes into account in/out ugens that have arguments, and is not always right about feedback cases):

I’ve got another class I use for doing dependency-based node ordering in my own code, but it’s imperfect and I’m working on something a bit more robust that I’d like to eventually share.

2 Likes

Thanks. That’s insightful.

I was discouraged here because I wrote some Haskell functions with granular parallelization inside them (via Control.Parallel). Is it useless in all cases, whatever the complexity of the audio process and the environment it is part of? Is that a law in audio programming?

I looked into the node level. In Supernova, the user is responsible, but I also found other research that goes in the other direction. It probably just doesn’t fit the SC model, or the complexity would not be worth it.

https://www.cse.wustl.edu/~cdgill/publications/RTCSA2012MCFlow.pdf

I think if SC had a fully-automated node ordering scheme, we would still have problems with order of execution.

Topological sort depends on each node knowing its antecedents and descendents. I think in many cases, it may be sufficient to specify only descendents, but I’m pretty sure that’s not always the case. For example, let’s say you start with:

LFO --> synthA --> reverb

Now you add synthB, which happens to be reading from the LFO and feeding into reverb, but you (accidentally) specified only the reverb dependency. Then you could end up with

synthB -----------\
LFO --> synthA --> reverb

… where synthB gets nothing from the LFO at audio rate, or stale data at control rate. A different heuristic (e.g. “keep the new synth closer to its descendants”) would get a better result but probably fail under other conditions. (I’ve done some experimenting in the SynthDef topo source and found, for instance, that inverting the sort logic improved some cases but degraded others. It’s very hard to make guarantees of optimal behavior in every case.)

Automatic node sorting just pushes the user-knowledge problem to a different place. Admittedly, this might be more user-friendly, but really it only changes the problem from “you need to understand your graph and maintain it by hand correctly” to “you need to supply accurate information about your graph every time you create a synth.”

So I don’t fully agree with scztt – design will not make this problem go away. … except that below, I talk about my own design to solve it, so that was not a very bright sentence to write lol

I’m curious, though – intuitively, I’d suspect that most of the time, a new node with accurate antecedents and descendents could find its correct location by locally scanning antecedents and descendents. Are there cases where it would be necessary to traverse the entire graph? That would be something to avoid (e.g., if you have 1000 nodes, good performance = stick it directly in a group where you know it belongs, bad performance = search the entire tree to find the right place for it).

Groups make a complex node order hierarchical, which is perhaps not functionally necessary for computers, but is very useful for humans.

Anyway… MixerChannel handles this by making antecedents/descendents an essential part of the interface, and expressing these in terms that will be familiar to DAW users:

// reverb is an antecedent of main
// main is a descendant of reverb
reverb.outbus = main;

// A is an antecedent of main
// main is a descendant of A
mixerA.outbus = main;

// A is an antecedent of main *and* reverb
// main and reverb are descendants of A
mixerA.newPostSend(reverb, -3.5.dbamp);

Each mixer starts with an ordering number = 1.

  1. reverb.outbus = main – reverb has a new descendent. So it takes the descendent’s ordering number and adds it to its own. Now reverb’s ordering number = 2. Sort in descending order, and reverb precedes main: OK.

  2. mixerA.outbus = main – same process; mixerA’s ordering number = 2. Now reverb and mixerA have the same ordering number, so they could be positioned in arbitrary order, but both will always proceed their common descendent main.

  3. mixerA.newPostSend(reverb, -3.5.dbamp) – mixerA adds a new direct descendent, so its ordering number becomes mixerA ord + reverb ord = 2 + 2 = 4. Sort in descending order, and you get mixerA, then reverb, then main. OK!

It goes recursively through the antecedents, so every mixer’s ordering number is guaranteed to be at least 1 greater than the sum of its descendents. (The fact that one mixer’s ordering number is greater than another’s doesn’t mean that the first feeds into the second – only that the descending sort will produce a valid ordering. BUT if two mixers have the same ordering number, then it guarantees that they don’t have dependencies among themselves, and they could run in parallel. This I haven’t implemented yet, but could.)

You also explicitly specify relationships without changing bus routing: mixerA.sendsSignalTo(mixerB) and related methods.

I’ve used MixerChannel in every production project I’ve ever done since 2004 or so, and the only order of execution problem that I have is if I put multiple fx synths into the same effectgroup.

Let me say that again: I do not have order of execution problems AT ALL except for multiple effects in the same group. I have live-sampling processes that tap into other mixers’ signals and all it has to do is to say BP(~source).asMixer.sendsSignalTo(~chan) (or pre-fader send) and it Just Works.

This is a solved problem.

hjh

If you get rid of busses, it would be impossible to forget something like this and the topological ordering would work as intended. An audio server that knows what a connect is doesn’t need to exposes busses to the user - supporting many to one connections could also work.

Inserting into a topologically sorted list is a known problem and walking the graph could be relatively quick if you used IDs allowing all information to be continuously in cache.
Perhaps like…

Node1
Id:u32
NParents:u16
Nchildren: u16
P1, P2..., C1, C2

Node2
.....

This would be a different audio engine though.

1 Like

If interested, I had implemented automatic node ordering and automatic parallelization of node graphs (if using supernova) in my live coding environment, Alga. The code specifically is AlgaBlock, which is automatically handled for you when using Alga. An AlgaBlock is essentially a Group of ordered nodes, with dependencies correctly resolved, feedback included (albeit there’s definitely corner cases I haven’t solved yet :slight_smile:)

4 Likes

I think for non-parallel execution, giving the server a flattened list representing a valid execution order (which the server can understand now) is enough. For parallel execution, a flat list is not enough - Tim introduced parallel groups, which are at least enough to do some common-sense parallelization, though as a data structure I don’t think it’s enough to express all the properties of a dependency graph.

I think incremental updates to the dependencies isn’t super tough - if I’m thinking about it right, you can just traverse the dependencies of any node that changes in one direction. I imagine it might be more challenging to produce a minimal node re-ordering that reflect only the required changes. But, it may be that this just isn’t very important thing to optimize, or whatever naive re-ordering the algorithm produces is generally good enough?

Currently, in some prototype code, I’m re-traversing the dependencies from scratch every time there’s a new event or a change. There are some obvious things that could be improved, but I’ll probably just continue to do this until I see real performance problems.

If interested, I had implemented automatic node ordering and automatic parallelization of node graphs (if using supernova) in my live coding environment, Alga

This is very cool, @vitreo12 - I’m definitely interested to take a look at how you’re doing this.

I guess I don’t think of Groups or ParGroups as data structures. They’re just a mechanism to realize a calculation structure (in the same way that buses and groups can realize a node graph, but they are not good on their own at expressing the graph structure – they’re too low level for that).

I’m curious what is a parallel calculation structure that is impossible to realize using ParGroup?

hjh

Well buffer reads and writes is the obvious one, but thinking outside of sc, you could require the user pass a priority when reading or writing to a buffer (identical priorities can only occur on reads and can happen in parallel). I think that captures everything? Except of course purposeful block delay and feedback, looping back to the top of the graph, which would be a different type of connection and not be a part of the sorting. I’m a little rusty on my graph algorithms, but there must be some way to partition a DAG into parallel chunks? … not that I’ve ever ran into a situation where scsynth’s performance was a bottleneck.

Not too familiar with supernova, but I seem to recall it solves this with a mutex/spinlock?

@vitreo12 I keep meaning to check out Alga in more detail, it looks great! How does it deal with buffer read and write order?

Thinking of something like:

A → B → E
  ↘   ↗
    C
      ↘   
        D  

B and C can be calculated in parallel - E has to wait for both but D can be calculated as soon as C is done. So the non-optimal case is if B is long-running, you have nodes sitting around waiting that don’t require it’s output. It’s not obvious to me how this kind of structure can be specified with a flat hierarchy like Groups/ParGroups. But, I haven’t thought about this a LOT, so I might be missing something obvious. There are a few more complex graph structures that can be represented if you can e.g. nest Groups inside of ParGroups, but I just now realized that it’s not really obvious whether you can do this from the documentation.

main :: IO ()
main = do
    let a' = a
        c' = c a
        b' = b a
        d' = c' `pseq` d c'  -- D waits for C
        e' = (b' `par` c') `pseq` e b' c'  -- E waits for both B and C

I think Buffers are straightforward as long as you have the notion of a feedback dependency (e.g. a read that doesn’t affect execution order), which is already required for Bus. This obviously gets complex for e.g. a multi-reader, multi-writer delay bus, but this case might be sort of unsolveable anyway because reading at different rates will ALREADY cause problems even with a “valid” node order. This can mostly just be specified as edges between nodes, e.g. ~writer[\bufferOut] = ~delayBuffer; ~reader[\bufferIn] = ~delayBuffer gives you all the information you need to know about a simple buffer read/write relationship (e.g. ~writer is an antecedant of ~reader). I think the Done.kr actions are difficult or impossible to encapsulate as dependency relationships, but maybe there’s a way to express a more limited version of the actions somehow? Pause.kr and Free.kr can point to specific nodes, so probably this is the more rigorous way to do the same things. I don’t know how this affects node ordering - e.g. if you Free.kr a node that hasn’t yet been run, does it get freed AFTER it runs, or is it removed right away?

there must be some way to partition a DAG into parallel chunks?

Yes, but if you can’t accurately predict the execution time of your nodes, my intuition (curious if there are proofs of this…) is that you can’t remove information from the graph by e.g. flattening it out without losing your ability to parallelize in at least SOME situations, for SOME graphs. The optimal execution order changes depending on the processing time of nodes - in my example to James, if B is long then D gets processed first - otherwise it could be E, or another dependent of B. Pragmatically, all of the parallel render graph code I’ve seen traverses the actual graph data structure, rather than a reduced version of it (like Node/ParGroup/Group lists) - I haven’t really ever seen any other clever tricks or data structures to solve this. I think SuperCollider’s architecture simply didn’t take parallelism into account, which makes sense for when it was written.

Yes, but I think I mean: this can’t be encapsulated just using the scsynth / supernova server structure, a flat list of Nodes, Groups, and ParGroups.

1 Like

Alga only deals with input and output edges when making a connection. So, if you were to specify this relationship at the edges of the two nodes (and connect one to the other), you’d get the ordering right.

This is definitely possible, and it’s in fact what I’m leveraging to get the parallelization within the graph in Alga.

1 Like

I never knew you could use actions to interact with nodes other than the current! Thanks!

I’ve literally never used actions to do anything other than free the current node (Done.freeSelf), I wonder if many people use the full breath of this feature?

3 Likes

It seems like many people are not aware of this, but ParGroup can be nested arbitrarily deep. In fact, Supernova calculates a dependency graph, which is updated whenever the DSP tree structure changes, i.e. when Nodes are added, moved or removed.

The problem with the original Group structure is that it is just a container, there is no information about the relationship between its children. In Supernova, Group actually means “serial group”, i.e. its children depend on each other and form a dependency chain. ParGroup, on the other hand, means that its children are independent from each other. I guess things would be more clear if Group was renamed to SerGroup :slight_smile:

Now, if you use (Ser)Group and ParGroup consequently, Supernova will do a decent job at parallizing the full graph. For example, if you have four independent “tracks” (e.g. a synth track, a bass track, a drum track, etc.), all tracks should go into a ParGroup. Each track itself might contain multiple voices, followed by serial FX chain. You would end up with a structure like this:

P = ParGroup
S = (Ser)Group

                     P(tracks)
              /           |         \
             /            |          \
       S(bass)         S(synth)     S(drums)
     /        \          etc.          etc.
P(voices) ->   S(fx)
            /    |    \
         FX1 -> FX2 -> FX3   

As @scztt noted correctly, the parallelization is not perfect. There are certain degenerate cases, for example:

        Group
      /       \
 ParGroup     N5
/  |  |  \
N1 N2 N3 N4

If N1 takes a lot more CPU time than N2-N4, the DSP helper threads will busy wait and just burn CPU cycles; only when N1 has finished they can move on to N5.

Another problem arises when you put a lot of nodes into a ParGroup; Supernova creates a dedicated DSP “task” for each node and the resulting scheduling overhead can diminish or even outweight the parallalization gains. This is particularly true if the nodes are very lightweight. If all the nodes are the same, or every similar, you actually want to group them into N tasks, where N = number of cores. For example, if you have 120 lightweight synth nodes in a ParGroup on a 8-core machine, you would want to schedule 8 tasks with 15 nodes each. IIRC, I have somewhere proposed an option for ParGroup that would enable exactly this.

5 Likes

Hm, but isn’t there one missing node here? You need to mix D and E down to a single output signal, and this summing operation must wait for both D and E.

In SC, summing may be implicit by having multiple Out.ar units writing to the same bus, but it is summing nonetheless, and it can’t be completed until all operands have completed.

So the optimization of doing D while B is waiting is perhaps less of an optimization than it may have seemed.

(Ok, perhaps D and E are writing to separate outputs, but I guess that’s pretty rare, and not terribly convincing because it’s all ok if it finishes within the hardware block time.)

hjh

I mean, the worst scenario would be B, it would hold both paths, that’s right. But there is the “best” scenario where the last two lines could be parallel, right?

I was putting it in another way. EDIT: one with the par operator (just a poetic license here) is that the first element is required later, but not immediately, so it is usually the most costly computation. This way, it’s part of the graph how expensive each computation you expect it to be.

That’s what ParGroup does; the children can be mixed later, but it does not mean they need to wait for each other.

What I was driving at is that (this occurred to me later) even if D and E write to different channels, there’s a single exit from the DSP chain – one locus that copies the channel data back to the audio driver. With ParGroup, D would have to wait for both B and C (even though it depends on only one of them). With a more robust mechanism, D could start and finish earlier – but its data would still have to wait for E. So I was thinking through: what would be the concrete benefit of performing D as soon as possible? I suppose it could be that a DSP thread with pending tasks might chew up a bit more CPU cycles than one whose queue is empty. Or, suppose both B and D are longer-running tasks and the others are fast; then there would be a benefit to starting D earlier. (That’s a convincing case actually.)

At the same time, practically speaking, my current machine is fast enough that, even under heavy use, I’m generally clocking in below 40% CPU without parallel processing.

hjh