Proposal: Refactor synthdef ugen optimizations

For myself, I’m opposed to putting an additional burden on the user.

The user already specifies the order UGens are created. That should be sufficient. My understanding of the spirit of the topo sort was to optimise that order where possible. Making the topo sort implementation simpler at the expense of user burden is cart before the horse IMV.

Yes what Jordan proposes has some complexity, but that ship sailed a long time ago, and I think it will be simpler than the existing approach. One could argue that WidthFirstUGen was also ‘forever’. I agree we should not make changes of this sort lightly or often, but as long as we provide a reasonable path to deprecation and update (nudge, don’t break) I think it’s okay.

The optimisations have never worked entirely correctly. If @jordan’s approach can fix it, then respectfully I think it’s worth doing and fulfilling the original intent.

Listen all, there are now over 80 messages in this thread, and this is one of many similar discussions over the years, including one fairly recently.

I think what might be most productive now would be for @jordan to come up with a draft implementation? That could focus the discussion on something tangible, and allow meaningful comparisons between alternatives?

Really glad this long standing issue is getting such thoughtful attention!

2 Likes

Almost done the weak antecedents thing, thanks to everyone’s input, I had a really clear idea of what I was doing and it has only taken about 5 hours or so — thank you!

Haven’t done the sorting of the graph yet, just making sure all the edges are there.

I don’t think its very complicated to be honest.

I’ll make a draft PR soon.

3 Likes

It only does the weak edges right now.

3 Likes

Topological sort now works, it also implements dead code elimination by default, but I wanted to share an example of how it was broken before.

Here is the synthdef.

SynthDef(\a, {
	var ins = 3.collect {|i| 
		if (i == 2){ 
			SendReply.kr(Impulse.kr, '/a', DC.kr(0.1));
			PlayBuf.ar(1, 0);
		} { In.ar(i) }
	};
	ins = ins * 2;
	ins = ins + 1;
	
	ins[1] = BufWr.ar(ins[1], 0, ins[1]);
	ins[0] = SendReply.kr(Dust.kr, '/foo', ins[0]);
	
	Out.ar(0, ins.sum)
}).dumpUGens;

The new result (I haven’t enabled optimisations yet so there are no MulAdds yet). As you can see, the read clearly happens before the write.

[0_In, audio, [1]]
[1_*, audio, [0_In[0], 2]]
[2_+, audio, [1_*, 1]]
[3_PlayBuf, audio, [0, 1.0, 1.0, 0.0, 0.0, 0]]  <<< here 
[4_BufWr, audio, [0, 2_+, 1.0, 2_+]] <<< and here
[5_*, audio, [3_PlayBuf[0], 2]]
[6_+, audio, [5_*, 1]]
[7_+, audio, [4_BufWr, 6_+]]
[8_In, audio, [0]]
[9_Out, audio, [0, 7_+]]
[10_*, audio, [8_In[0], 2]]
[11_+, audio, [10_*, 1]]
[12_Dust, control, [0.0]]
[13_DC, control, [0.1]]
[14_Impulse, control, [440.0, 0.0]]
[15_SendReply, control, [14_Impulse, -1, 2, 47, 97, 13_DC[0]]]
[16_SendReply, control, [12_Dust, -1, 4, 47, 102, 111, 111, 11_+]]

The old way, where they happen in the wrong order.

[0_In, audio, [0]]
[1_MulAdd, audio, [0_In[0], 2, 1]]
[2_In, audio, [1]]
[3_MulAdd, audio, [2_In[0], 2, 1]]
[4_BufWr, audio, [0, 3_MulAdd, 1.0, 3_MulAdd]]  <<< here
[5_Impulse, control, [440.0, 0.0]]
[6_DC, control, [0.1]]
[7_SendReply, control, [5_Impulse, -1, 2, 47, 97, 6_DC[0]]]
[8_PlayBuf, audio, [0, 1.0, 1.0, 0.0, 0.0, 0]]  <<< and here
[9_MulAdd, audio, [8_PlayBuf[0], 2, 1]]
[10_+, audio, [4_BufWr, 9_MulAdd]]
[11_Out, audio, [0, 10_+]]
[12_Dust, control, [0.0]]
[13_SendReply, control, [12_Dust, -1, 4, 47, 102, 111, 111, 1_MulAdd]]
1 Like

It looks good and correct. The (old) algorithm treated the graph simpler than it was. There is no “overengineering” here; it just reads the types now.

It would be good to test with many, many cases.

Just an update, I’ve now enabled basic optimisations and the graph looks like…

0_In, audio, [1]]
[1_Sum3, audio, [0_In[0], 0_In[0], 1]]
[2_PlayBuf, audio, [0, 1.0, 1.0, 0.0, 0.0, 0]]   <<<<<< Read
[3_BufWr, audio, [0, 1_Sum3, 1.0, 1_Sum3]]   <<<< Write
[4_Sum4, audio, [3_BufWr, 2_PlayBuf[0], 2_PlayBuf[0], 1]]  <<<<< PlayBuf + PlayBuf
[5_In, audio, [0]]
[6_Out, audio, [0, 4_Sum4]]
[7_Sum3, audio, [5_In[0], 5_In[0], 1]]
[8_Dust, control, [0.0]]
[9_Impulse, control, [440.0, 0.0]]
[10_DC, control, [0.1]]
[11_SendReply, control, [9_Impulse, -1, 2, 47, 97, 10_DC[0]]]
[12_SendReply, control, [8_Dust, -1, 4, 47, 102, 111, 111, 7_Sum3]]

Its shorter than before because I’ve enabled rewriting a * 2 into a + a, which with Sum3/4 can reduce the number of UGen calls. See github for more information on the details.

Tests will be coming shortly. Although I am unsure how best to go about this as the graphs will be different.

A null test between old and new might be a good approach, but I am unsure how to scale this.

Another test case (harder one – the current topo sort handles this case quite badly) – Troubleshooting: numWireBufs & NamedControls problem - #5 by jamshark70

hjh

I’ll have a look, but should say that limiting the number of wire buffers isn’t the goal because @Spacechild1 wasn’t able to observe a performance hit.

I don’t really have time to count the wirebuffs, but I think it is better?

Ex1

(
d = SynthDef(\test, {
	var n = 3;
	var ctls = Array.fill(n, { |i|
		[
			("freq" ++ i).asSymbol.kr(110 * (i+1)),
			("amp" ++ i).asSymbol.kr(1 / (i+1))
		]
	}).flop;  // 'flop' puts all freqs in one subarray, amps in the other
	var sig = SinOsc.ar(ctls[0]) * ctls[1];
	Out.ar(0, sig.sum)
}).dumpUGens;
)

OLD

[ 0_Control, control, nil ]
[ 1_SinOsc, audio, [ 0_Control[0], 0.0 ] ]
[ 2_Control, control, nil ]
[ 3_Control, control, nil ]
[ 4_SinOsc, audio, [ 3_Control[0], 0.0 ] ]
[ 5_Control, control, nil ]
[ 6_*, audio, [ 4_SinOsc, 5_Control[0] ] ]
[ 7_MulAdd, audio, [ 1_SinOsc, 2_Control[0], 6_* ] ]
[ 8_Control, control, nil ]
[ 9_SinOsc, audio, [ 8_Control[0], 0.0 ] ]
[ 10_Control, control, nil ]
[ 11_MulAdd, audio, [ 9_SinOsc, 10_Control[0], 7_MulAdd ] ]
[ 12_Out, audio, [ 0, 11_MulAdd ] ]

NEW

[0_Control, control, [110]]
[1_SinOsc, audio, [0_Control[0], 0.0]]
[2_Control, control, [1.0]]
[3_*, audio, [1_SinOsc, 2_Control[0]]]
[4_Control, control, [220]]
[5_SinOsc, audio, [4_Control[0], 0.0]]
[6_Control, control, [0.5]]
[7_MulAdd, audio, [5_SinOsc, 6_Control[0], 3_*]]
[8_Control, control, [330]]
[9_SinOsc, audio, [8_Control[0], 0.0]]
[10_Control, control, [0.33333333333333]]
[11_MulAdd, audio, [9_SinOsc, 10_Control[0], 7_MulAdd]]
[12_Out, audio, [0, 11_MulAdd]]

Ex2

(
d = SynthDef(\test, {
	var n = 3;
	var freqs = Array.fill(n, { |i|
		("freq" ++ i).asSymbol.kr(110 * (i+1))
	});
	var amps = Array.fill(n, { |i|
		("amp" ++ i).asSymbol.kr(1 / (i+1))
	});
	var sines = SinOsc.ar(freqs);
	var sig = sines * amps;
	Out.ar(0, sig.sum)
}).dumpUGens;
)

OLD

[ 0_Control, control, nil ]
[ 1_SinOsc, audio, [ 0_Control[0], 0.0 ] ]
[ 2_Control, control, nil ]
[ 3_SinOsc, audio, [ 2_Control[0], 0.0 ] ]
[ 4_Control, control, nil ]
[ 5_SinOsc, audio, [ 4_Control[0], 0.0 ] ]
[ 6_Control, control, nil ]
[ 7_Control, control, nil ]
[ 8_*, audio, [ 3_SinOsc, 7_Control[0] ] ]
[ 9_MulAdd, audio, [ 1_SinOsc, 6_Control[0], 8_* ] ]
[ 10_Control, control, nil ]
[ 11_MulAdd, audio, [ 5_SinOsc, 10_Control[0], 9_MulAdd ] ]
[ 12_Out, audio, [ 0, 11_MulAdd ] ]

NEW

[0_Control, control, [110]]
[1_SinOsc, audio, [0_Control[0], 0.0]]
[2_Control, control, [1.0]]
[3_*, audio, [1_SinOsc, 2_Control[0]]]
[4_Control, control, [220]]
[5_SinOsc, audio, [4_Control[0], 0.0]]
[6_Control, control, [0.5]]
[7_MulAdd, audio, [5_SinOsc, 6_Control[0], 3_*]]
[8_Control, control, [330]]
[9_SinOsc, audio, [8_Control[0], 0.0]]
[10_Control, control, [0.33333333333333]]
[11_MulAdd, audio, [9_SinOsc, 10_Control[0], 7_MulAdd]]
[12_Out, audio, [0, 11_MulAdd]]

FYI, the PR is now functional, but the tests and docs need updating if any one wants to try it.

Great work!

I don’t really have time to count the wirebuffs, but I think it is better?

Note that for testing purposes you could just print the number of wire buffers inside scsynth so we don’t have to count :slight_smile:

As I said, I only tested with scsynth and things might be different with Supernova! (In Supernova every Synth needs its own array of wire buffers; with many Synths this can add up quickly.) Also, I only did some quick synthetic benchmarks; these things also need to be tested with real life work loads.

Ah I forgot about supernova. I’m not too familiar with scsynth, is it possible to get the number of wire buffers back into sc? Would be good to make some tests, at least making sure things scale in a particular way.

GraphDef_Read (in SC_GraphDef.cpp) is the function that actually creates the SynthDef. At the end of the function you can just post the value of mNumWireBufs to the console.

3 Likes

Today I’ve made tests and common sub expression elimination.

The old tests where testing that the optimisation was applied. Now, I fill a buffer with an un-optimised SynthDef and and optimised one ensuring they fulfil a null test (below -240db).

Here are some benchmarks from the optimisations. @jamshark70 I’ve taken these from many of the threads on here you’ve been in involved in over the years and I’ve also borrowed heavily from your CSE implementation, so thank you :heart: !

The method SynthDef.newWithoutOptimisations turns off sorting, replacement and CSE, it literally run it in the order the users supplied.

Test 1, very wide.

(
f = { |n = 100|
	bench {
		SynthDef.newWithoutOptimisations(\additive, {
			var oscs = Array.fill(n, { SinOsc.ar(ExpRand(200, 300)) });
			Out.ar(0, (oscs.sum * 0.001).dup)
		}).children.size.debug("Num UGens");
	};
	
	bench {
		SynthDef(\additive, {
			var oscs = Array.fill(n, { SinOsc.ar(ExpRand(200, 300)) });
			Out.ar(0, (oscs.sum * 0.001).dup)
		}).children.size.debug("Num UGens");
	};
};
)
f.(2000);
Num UGens: 6001
time to run: 0.049266753999994 seconds.
Num UGens: 4669
time to run: 0.634530913 seconds.

f.(200);
Num UGens: 601
time to run: 0.010466048999994 seconds.
Num UGens: 469
time to run: 0.022342442999999 seconds.

f.(20);
Num UGens: 61
time to run: 0.001505336000001 seconds.
Num UGens: 49
time to run: 0.0039255450000013 seconds.

Test 2, very deep.


(
f = { 
	bench {
		SynthDef.newWithoutOptimisations(\additive, {
			// tw 0011 (f0)
			var s = { |o, i| SinOsc.ar([i, i + 0.0001] ** 2 * f.value(o, i - 1), f.value(o, i - 1) * 0.0001) * f.value(o, i - 1)};
			var f = { |o, i| if(i > 0, { s.value(o, i) }, o)};
			Out.ar(0, (f.value(60, 6) / 60).sum);
		}).children.size.debug("Num UGens");
	};
	
	bench {
		SynthDef(\additive, {
			// tw 0011 (f0)
			var s = { |o, i| SinOsc.ar([i, i + 0.0001] ** 2 * f.value(o, i - 1), f.value(o, i - 1) * 0.0001) * f.value(o, i - 1)};
			var f = { |o, i| if(i > 0, { s.value(o, i) }, o)};
			Out.ar(0, (f.value(60, 6) / 60).sum);
		}).children.size.debug("Num UGens");
	};
};
)

Results, I think this is quite impressive!!

f.();
Num UGens: 1944
time to run: 0.025203086000005 seconds.
Num UGens: 48   <<< wow!
time to run: 0.30789417 seconds.
-> 0.30789417
3 Likes

Discovered yet another bug. This one is quite important. #6411

TLDR:
RecordBuf will accept a scalar as input, but doesn’t complain.


b = Buffer.alloc(s, 0.01 * s.sampleRate);
{ RecordBuf.ar(0, b, loop: 0, doneAction: 2) }.play;
b.getToFloatArray(action: {|ar| ar.postln });

// Should be filled with zeros
FloatArray[0.0, 0.0, 0.0, 0.0, 0.0, 3.0782323365823e-41, -6.9850717199076e+35, 4.580564420185e-41, 1.0, 2.3510242946256e-38, 0.0, 0.0, 0.0, 0.0, -6.9850970729196e+35, 4.580564420185e-41, 2.0, 2.8025969286496e-44, 0.0, 0.0, 0.0, 3.0782323365823e-41, -6.9851224259316e+35, 4.580564420185e-41, -99.0, 9.4039548065783e-38, -6.9853981399372e+35, 4.580564420185e-41, 0.0, 4.580564420185e-41, -6.9851477789436e+35, 4.580564420185e-41, 0.0, 0.0, -6.9854932137322e+35, 4.580564420185e-41, 2.8025969286496e-45, 9.403954806...etc...

This is important for optimisations such as a * 0 => 0.

b = Buffer.alloc(s, 0.01 * s.sampleRate);

{ RecordBuf.ar(SinOsc.ar * 0, b, loop: 0, doneAction: 2) }.play;

b.getToFloatArray(action: {|ar| ar.postln });

// Should be filled with zeros
FloatArray[0.0, 0.0, 0.0, 0.0, 0.0, 3.0782323365823e-41, -6.9850717199076e+35, 4.580564420185e-41, 1.0, 2.3510242946256e-38, 0.0, 0.0, 0.0, 0.0, -6.9850970729196e+35, 4.580564420185e-41, 2.0, 2.8025969286496e-44, 0.0, 0.0, 0.0, 3.0782323365823e-41, -6.9851224259316e+35, 4.580564420185e-41, -99.0, 9.4039548065783e-38, -6.9853981399372e+35, 4.580564420185e-41, 0.0, 4.580564420185e-41, -6.9851477789436e+35, 4.580564420185e-41, 0.0, 0.0, -6.9854932137322e+35, 4.580564420185e-41, 2.8025969286496e-45, 9.403954806...etc...

Edit: I’ve added a method to let UGens rewrite their inputs post optimisation.

I’ve marked the PR as no longer a draft and I think it is now in a state that people can start using it and having a look at the approach taken.

… Its a big PR +3,400 −1,790, but quite a lot of that is tests, and I think my approach here is much better than what was there before.

I did some benchmarks between un-optimised graphs and optimised, TLDR: anything betweeen 2x and 37x was measured.

I need to write some docs for UGen authors but it basically boils down to adding these methods if you want the UGen to be optimised correctly. If you don’t add them, only performance suffers, not behaviour.

For a ‘pure’ ugen (the class PureUGen is not recommended any more and I’ve remove all uses from the class lib).

resourceManagers { ^[] }
hasObservableEffect { ^false }
canBeReplacedByIdenticalCall { ^true }

For BufRd

resourceManagers { ^[UGenBufferResourceManager] }
bufferAccessType { ^\read }
hasObservableEffect { ^false }
canBeReplacedByIdenticalCall { ^true }

BufWr

resourceManagers { ^[UGenBufferResourceManager] }
bufferAccessType { ^\write }
hasObservableEffect { ^true }
canBeReplacedByIdenticalCall { ^true }  // subsequent identical reads do nothing

Out

resourceManagers { ^[UGenBusResourceManager] }
busAccessType { ^\write }
hasObservableEffect { ^true }
canBeReplacedByIdenticalCall { ^false }  // Out adds to the existing buffer

ReplaceOut

resourceManagers { ^[UGenBusResourceManager] }
busAccessType { ^\overwrite }
hasObservableEffect { ^true }
canBeReplacedByIdenticalCall { ^true }  // replaces buffer

Just coming back to this because I forgot to post the results.
Printing code.

GraphDef* GraphDef_Read(World* inWorld, char*& buffer, GraphDef* inList, int32 inVersion) {
    ...
    scprintf("%s: numWires %d numWireBuffers %d\n", graphDef->mNodeDef.mName, graphDef->mNumWires, graphDef->mNumWireBufs);
}

Example

(
d = SynthDef(\test, {
	var n = 50;
	var freqs = Array.fill(n, { |i|
		("freq" ++ i).asSymbol.kr(110 * (i+1))
	});
	var amps = Array.fill(n, { |i|
		("amp" ++ i).asSymbol.kr(1 / (i+1))
	});
	var sines = SinOsc.ar(freqs);
	var sig = sines * amps;
	Out.ar(0, sig.sum)
});

d.dumpUGens;

d.add;
)

Results

// this PR n == 50
test: numWires 301 numWireBuffers 2

// current Dev n == 50
test: numWires 201 numWireBuffers 50

As you can see, this produces the optimum number of wire buffers.

1 Like

Cool. I wonder if it’s any different with arrayed controls – it shouldn’t be…? Just with arrayed controls, all the channels’ descendants become “available” all at once, rather than one by one.

	var n = 50;
	var freqs = \freqs.kr(Array.fill(n, { |i| 110 * (i+1) }));
	var amps = \amps.kr( Array.fill(n, { |i| 1 / (i+1) }));

In current dev, this variant produces a wide graph too.

hjh

I’ll check, but it shouldn’t because the antecedent is an output proxy. I had a whole saga dealing with those…

I can’t upload an image directly on to github so I’m just dumping this here, but this is a flame graph of the very large synthdef that reduce 2k ugens down to 48 with CSE, taking 0.8 seconds. Dictionary appears to the main cause of slow down, as they keys are [rate, inputArray, weakAntescendentsArray] and nested arrays makes the equality slow. If any smart people can think of a better way of hashing this information let me know :slight_smile:

Also, Dictionary.grow is a little under O(n^2) (different graph). If anyone has any ideas on that, it would also make a big difference!