I knew there’d be a trade off, just wasn’t sure how they compare. Seems like function calling is actually quite slow comparatively.
I did not consider L1! I’ll have a look later, but that would be a more realistic test. Thanks!
I knew there’d be a trade off, just wasn’t sure how they compare. Seems like function calling is actually quite slow comparatively.
I did not consider L1! I’ll have a look later, but that would be a more realistic test. Thanks!
That only happens with the compilation flag -O3, not -O2 (which keeps 1.2 faster).
-O3 optimizes duplicate for us with goodies but no error checking, -O2 not
I don’t think many people compile sc with -O3, for instance.
What I can get:
-O3 uses SIMD instructions ( movdqu movups and similar)
Do function calls remain the same? I think so.
(I think) It also avoids error checking and does not set up before the main loop. -O2 keeps the setup.
it places the comparison cmp %rbx,%rbp
at the end
; Main loop body
41.42% movswq 0xc(%rbx),%rax
0.07% lea 0x20(%rsp,%rax,4),%rdx
movswq 0xa(%rbx),%rax
0.07% lea 0x20(%rsp,%rax,4),%rsi
40.42% movswq 0x8(%rbx),%rax
lea 0x20(%rsp,%rax,4),%rdi
callq *(%rbx)
add $0x10,%rbx
5.36% cmp %rbx,%rbp
jne 407ea0 <duplicate(benchmark::State&)+0x130>
Seems dubious, citation needed perhaps?
Edit.
Pretty sure cmake’s defaults for release mode are -O3 -DNDEBUG
so almost everyone will be -O3
.
I remember warnings against -O3 here and there. I think it’s a thing of the past now
" While historically -O3
could expose compiler bugs, nowadays issues with -O3
are almost always instances of undefined behavior (UB) and the code is at fault and should be fixed."
For context: The stated thread topic is about removing redundancy from SynthDefs. The recent discussion of compiler optimizations is, then, to try to determine whether there are cases in which a SynthDef might perform better with redundant operations?
hjh
Yes. I think your own example in Proposal: SynthDefs could eliminate duplicate UGens - #36 by jamshark70 could be such a case, at least under certain circumstances.
It’s Jordan’s, actually – here, then, the issue is a conflict between two optimizations. If there were no Sum3, then the no-duplicate version would be unambiguously faster.
hjh
Does Sum3 and simd_unit guarantee memory alignment?
First, I’m not sure what SIMD has to do with the topic at hand. Second, you can trust that the people who wrote the SIMD code knew what they were doing. (Yes, wire buffers are always properly aligned for SIMD.)
All ugens would need to implement those optimizations
UGens themselves don’t need to implement anything. These optimizations are done purely in the language by the SynthDef builder. In the case of Sum3 and Sum4, a chain of 2 resp. 3 BinaryOpUGens (+
) is replaced by a single UGen; the reasoning is that doing the 3 resp. 4 additions in a single process function is much more efficient (fewer function calls, better pipelining).
build with LTO
Why would you need to build with LTO? In fact, LTO is practically irrelevant for most plugins because they typically consist of a single source file.
I think James’ method is safer since I’m not sure all elements will always be polished. Of course, that’s a guess.
It’s not an either-or question, these are completely different optimizations. It just happens that there a specific cases where they appear to conflict with each other.
where you say that a chain of binary compiled agents constructs a chain
I did not say that. Again, these optimizations happen in the language during SynthDef construction. Scsynth itself does no graph optimizations at all. Hence all this discussions about SIMD, compiler flags, etc. is moot because it obviously doesn’t apply to sclang.
I would suggest to actually study the relevant parts in the ClassLibrary (in particular optimizeGraph
).
I didn’t say that.
Ok, so it functions as Mix. supercollider/SCClassLibrary/Common/Audio/Mix.sc at develop · supercollider/supercollider · GitHub
I misunderstood what Sum3 was since the word optimization was circulation around, and I may confused about it
(sorry, no time to check the codebase for that now)
word optimization was circulation around
It was not “circulating around”, James proposed a specific optimization with a working implementation. If you had a quick look at the actual code, you would have noticed that it was not about C++ at all.
Don’t jump to quick conclusions! It tends to derail the discussion.
Mix is like bread and butter in sc. I didn’t provide any absurd information per se; they are also ways to do things, even if it’s not in the codebase today. Sorry if I disturbed the discussion. (Although I think there is no need for exaggeration in public messages) – Thus, I should also apologize to the SC community for suboptimal behavior in this discussion, a community I love so much.
But also about defining what is redundant at what level.
The rather flawed benchmark I posted shows evidence that the number of ugen calls should be minimised as the act of calling something out weights any cache optimisation — but that is a rather limited benchmark and doesn’t quite replicate how cache would work.
SynthDef(\s, {
|a, b, c|
var sig = a + b + c;
Out.kr(0, sig)
}).dumpUGens
[0_Control, control, nil]
[1_Sum3, control, [0_Control[2], 0_Control[1], 0_Control[0]]]
[2_Out, control, [0, 1_Sum3]]
Since your optimisation pass happens after the ugens are created it seems fine in the above case and the question I raised doesn’t affect anything.
One interesting case, though not directly related to this proposal is when the user has already abstracted out the common operation. In the two examples below, the (admittedly flawed) benchmark I posted would suggest that the second is about 1/6 faster.
SynthDef(\s, {
|a, b, c, d|
var sig = a + b;
Out.kr(1, sig + d);
Out.kr(0, sig + c)
}).dumpUGens
[0_Control, control, nil]
[1_+, control, [0_Control[0], 0_Control[1]]]
[2_+, control, [1_+, 0_Control[3]]]
[3_Out, control, [1, 2_+]]
[4_+, control, [1_+, 0_Control[2]]]
[5_Out, control, [0, 4_+]]
SynthDef(\s, {
|a, b, c, d|
Out.kr(1, a + b + d);
Out.kr(0, a + b + c)
}).dumpUGens
[0_Control, control, nil]
[1_Sum3, control, [0_Control[3], 0_Control[1], 0_Control[0]]]
[2_Out, control, [1, 1_Sum3]]
[3_Sum3, control, [0_Control[2], 0_Control[1], 0_Control[0]]]
[4_Out, control, [0, 3_Sum3]]
There could be a compiler pass to transform one to the other. In this case it is relatively simple and could be done with a little bit of regex.
I’ve worked directly with graph representation, which has been extensively studied. I would suggest using graph representation for anything, really. Regex for this kind of thing?
I only meant it would be an easy transformation…
Edit: although checking it isn’t used after might be a bit hard and I’m not sure supercollider already has this functionality built in.
Edit 2: nope, this is harder than I though…
Yes, a pattern matching on a graph.
Example of a simplified pattern matching (not actual work code, just for simple illustration; for type-safe pattern matching, one could use type classes to define methods and transformation. A ugen GADT could also help encapsulate types, etc)
optimizeUGen :: UGen -> UGen
optimizeUGen (AddUGen (Const 0) u) = optimizeUGen u
optimizeUGen (AddUGen u (Const 0)) = optimizeUGen u
optimizeUGen (MulUGen (Const 1) u) = optimizeUGen u
optimizeUGen (MulUGen u (Const 1)) = optimizeUGen u
optimizeUGen (MulUGen (Const 0) _) = Const 0
optimizeUGen (MulUGen _ (Const 0)) = Const 0
optimizeUGen (AddUGen (Const a) (Const b)) = Const (a + b)
optimizeUGen (MulUGen (Const a) (Const b)) = Const (a * b)
// simple recursion examples
optimizeUGen (AddUGen u1 u2) =
let ou1 = optimizeUGen u1
ou2 = optimizeUGen u2
in case (ou1, ou2) of
(Const a, Const b) -> Const (a + b)
_ -> AddUGen ou1 ou2
optimizeUGen (MulUGen u1 u2) =
let ou1 = optimizeUGen u1
ou2 = optimizeUGen u2
in case (ou1, ou2) of
(Const a, Const b) -> Const (a * b)
_ -> MulUGen ou1 ou2
optimizeUGen (AddUGen (AddUGen a b) c) = Sum3 (optimizeUGen a) (optimizeUGen b) (optimizeUGen c)
optimizeUGen (AddUGen a (AddUGen b c)) = Sum3 (optimizeUGen a) (optimizeUGen b) (optimizeUGen c)
optimizeUGen ugen = ugen
Feel free to have a go implementing this.
I did already. I posted the results and benchmarks. What precisely are you referring to?
Turning this…
… into this…
Sorry I didn’t realise this had been done before!!! Got a link? Would love to see how it is done and updates all the descendants.