Proposal: SynthDefs could eliminate duplicate UGens

jordan · July 19, 2024, 7:55pm

I knew there’d be a trade off, just wasn’t sure how they compare. Seems like function calling is actually quite slow comparatively.

I did not consider L1! I’ll have a look later, but that would be a more realistic test. Thanks!

smoge · July 20, 2024, 8:58am

That only happens with the compilation flag -O3, not -O2 (which keeps 1.2 faster).

-O3 optimizes duplicate for us with goodies but no error checking, -O2 not

I don’t think many people compile sc with -O3, for instance.

What I can get:

-O3 uses SIMD instructions ( movdqu movups and similar)
Do function calls remain the same? I think so.
(I think) It also avoids error checking and does not set up before the main loop. -O2 keeps the setup.
it places the comparison cmp %rbx,%rbp at the end

; Main loop body
41.42% movswq 0xc(%rbx),%rax
0.07%  lea    0x20(%rsp,%rax,4),%rdx
       movswq 0xa(%rbx),%rax
0.07%  lea    0x20(%rsp,%rax,4),%rsi
40.42% movswq 0x8(%rbx),%rax
       lea    0x20(%rsp,%rax,4),%rdi
       callq  *(%rbx)
       add    $0x10,%rbx
5.36%  cmp    %rbx,%rbp
       jne    407ea0 <duplicate(benchmark::State&)+0x130>

jordan · July 20, 2024, 1:43pm

Seems dubious, citation needed perhaps?

Edit.

Pretty sure cmake’s defaults for release mode are -O3 -DNDEBUG so almost everyone will be -O3.

smoge · July 20, 2024, 2:03pm

I remember warnings against -O3 here and there. I think it’s a thing of the past now

" While historically -O3 could expose compiler bugs, nowadays issues with -O3 are almost always instances of undefined behavior (UB) and the code is at fault and should be fixed."

jamshark70 · July 20, 2024, 11:51pm

For context: The stated thread topic is about removing redundancy from SynthDefs. The recent discussion of compiler optimizations is, then, to try to determine whether there are cases in which a SynthDef might perform better with redundant operations?

hjh

Spacechild1 · July 21, 2024, 12:13am

Yes. I think your own example in Proposal: SynthDefs could eliminate duplicate UGens - #36 by jamshark70 could be such a case, at least under certain circumstances.

jamshark70 · July 21, 2024, 3:02am

It’s Jordan’s, actually – here, then, the issue is a conflict between two optimizations. If there were no Sum3, then the no-duplicate version would be unambiguously faster.

hjh

Spacechild1 · July 21, 2024, 11:27am

Does Sum3 and simd_unit guarantee memory alignment?

First, I’m not sure what SIMD has to do with the topic at hand. Second, you can trust that the people who wrote the SIMD code knew what they were doing. (Yes, wire buffers are always properly aligned for SIMD.)

All ugens would need to implement those optimizations

UGens themselves don’t need to implement anything. These optimizations are done purely in the language by the SynthDef builder. In the case of Sum3 and Sum4, a chain of 2 resp. 3 BinaryOpUGens (+) is replaced by a single UGen; the reasoning is that doing the 3 resp. 4 additions in a single process function is much more efficient (fewer function calls, better pipelining).

build with LTO

Why would you need to build with LTO? In fact, LTO is practically irrelevant for most plugins because they typically consist of a single source file.

I think James’ method is safer since I’m not sure all elements will always be polished. Of course, that’s a guess.

It’s not an either-or question, these are completely different optimizations. It just happens that there a specific cases where they appear to conflict with each other.

Spacechild1 · July 21, 2024, 11:40am

where you say that a chain of binary compiled agents constructs a chain

I did not say that. Again, these optimizations happen in the language during SynthDef construction. Scsynth itself does no graph optimizations at all. Hence all this discussions about SIMD, compiler flags, etc. is moot because it obviously doesn’t apply to sclang.

I would suggest to actually study the relevant parts in the ClassLibrary (in particular optimizeGraph).

smoge · July 21, 2024, 11:50am

I didn’t say that.

Ok, so it functions as Mix. supercollider/SCClassLibrary/Common/Audio/Mix.sc at develop · supercollider/supercollider · GitHub

I misunderstood what Sum3 was since the word optimization was circulation around, and I may confused about it

(sorry, no time to check the codebase for that now)

Spacechild1 · July 21, 2024, 11:58am

word optimization was circulation around

It was not “circulating around”, James proposed a specific optimization with a working implementation. If you had a quick look at the actual code, you would have noticed that it was not about C++ at all.

Don’t jump to quick conclusions! It tends to derail the discussion.

smoge · July 21, 2024, 12:03pm

Mix is like bread and butter in sc. I didn’t provide any absurd information per se; they are also ways to do things, even if it’s not in the codebase today. Sorry if I disturbed the discussion. (Although I think there is no need for exaggeration in public messages) – Thus, I should also apologize to the SC community for suboptimal behavior in this discussion, a community I love so much.

jordan · July 21, 2024, 12:52pm

But also about defining what is redundant at what level.

The rather flawed benchmark I posted shows evidence that the number of ugen calls should be minimised as the act of calling something out weights any cache optimisation — but that is a rather limited benchmark and doesn’t quite replicate how cache would work.

SynthDef(\s, {
	|a, b, c|
	var sig = a + b + c;
	Out.kr(0, sig)
}).dumpUGens

[0_Control, control, nil]
[1_Sum3, control, [0_Control[2], 0_Control[1], 0_Control[0]]]
[2_Out, control, [0, 1_Sum3]]

Since your optimisation pass happens after the ugens are created it seems fine in the above case and the question I raised doesn’t affect anything.

One interesting case, though not directly related to this proposal is when the user has already abstracted out the common operation. In the two examples below, the (admittedly flawed) benchmark I posted would suggest that the second is about 1/6 faster.

SynthDef(\s, {
	|a, b, c, d|
	var sig = a + b;
	Out.kr(1, sig + d);
	Out.kr(0, sig + c)
}).dumpUGens

[0_Control, control, nil]
[1_+, control, [0_Control[0], 0_Control[1]]]
[2_+, control, [1_+, 0_Control[3]]]
[3_Out, control, [1, 2_+]]
[4_+, control, [1_+, 0_Control[2]]]
[5_Out, control, [0, 4_+]]

SynthDef(\s, {
	|a, b, c, d|
	Out.kr(1, a + b + d);
	Out.kr(0, a + b + c)
}).dumpUGens

[0_Control, control, nil]
[1_Sum3, control, [0_Control[3], 0_Control[1], 0_Control[0]]]
[2_Out, control, [1, 1_Sum3]]
[3_Sum3, control, [0_Control[2], 0_Control[1], 0_Control[0]]]
[4_Out, control, [0, 3_Sum3]]

There could be a compiler pass to transform one to the other. In this case it is relatively simple and could be done with a little bit of regex.

smoge · July 21, 2024, 12:55pm

I’ve worked directly with graph representation, which has been extensively studied. I would suggest using graph representation for anything, really. Regex for this kind of thing?

jordan · July 21, 2024, 1:02pm

I only meant it would be an easy transformation…
Edit: although checking it isn’t used after might be a bit hard and I’m not sure supercollider already has this functionality built in.
Edit 2: nope, this is harder than I though…

smoge · July 21, 2024, 1:04pm

Yes, a pattern matching on a graph.

smoge · July 21, 2024, 1:41pm

Example of a simplified pattern matching (not actual work code, just for simple illustration; for type-safe pattern matching, one could use type classes to define methods and transformation. A ugen GADT could also help encapsulate types, etc)

optimizeUGen :: UGen -> UGen
optimizeUGen (AddUGen (Const 0) u) = optimizeUGen u
optimizeUGen (AddUGen u (Const 0)) = optimizeUGen u
optimizeUGen (MulUGen (Const 1) u) = optimizeUGen u
optimizeUGen (MulUGen u (Const 1)) = optimizeUGen u
optimizeUGen (MulUGen (Const 0) _) = Const 0
optimizeUGen (MulUGen _ (Const 0)) = Const 0
optimizeUGen (AddUGen (Const a) (Const b)) = Const (a + b)
optimizeUGen (MulUGen (Const a) (Const b)) = Const (a * b)

// simple recursion examples
optimizeUGen (AddUGen u1 u2) = 
  let ou1 = optimizeUGen u1
      ou2 = optimizeUGen u2
  in case (ou1, ou2) of
       (Const a, Const b) -> Const (a + b)
       _ -> AddUGen ou1 ou2

optimizeUGen (MulUGen u1 u2) = 
  let ou1 = optimizeUGen u1
      ou2 = optimizeUGen u2
  in case (ou1, ou2) of
       (Const a, Const b) -> Const (a * b)
       _ -> MulUGen ou1 ou2

optimizeUGen (AddUGen (AddUGen a b) c) = Sum3 (optimizeUGen a) (optimizeUGen b) (optimizeUGen c)
optimizeUGen (AddUGen a (AddUGen b c)) = Sum3 (optimizeUGen a) (optimizeUGen b) (optimizeUGen c)

optimizeUGen ugen = ugen

jordan · July 21, 2024, 1:44pm

Feel free to have a go implementing this.

smoge · July 21, 2024, 1:44pm

I did already. I posted the results and benchmarks. What precisely are you referring to?

jordan · July 21, 2024, 1:50pm

Turning this…

jordan:

[0_Control, control, nil]
[1_+, control, [0_Control[0], 0_Control[1]]]
[2_+, control, [1_+, 0_Control[3]]]
[3_Out, control, [1, 2_+]]
[4_+, control, [1_+, 0_Control[2]]]
[5_Out, control, [0, 4_+]]

… into this…

Sorry I didn’t realise this had been done before!!! Got a link? Would love to see how it is done and updates all the descendants.