SynthDef and UGen further optimization (even JIT compilation?)

smoge · June 19, 2024, 1:46pm

One related question to the senior devs: now in 2024, not in 2000, if UGens were compiled to LLVM IR or bitcode, are there ways to automatically optimize the SynthDef Graph and compile it with a superior optimization than we do now?

Faust has the framework to go through this process, but I don’t have numbers to compare.

EDIT: The logic is that jit compilation at runtime will be based on current conditions and a specific DSP graph of the synthdef as a unit. Theoretically, it should be the way to go unless the technology is NOT reliable, as sometimes one is led to think.

@scztt maybe you know this stuff?

smoge · June 19, 2024, 1:47pm

Oh, sorry. But at a certain point in the past, UnaryOpUGen and BinaryOpUgen were special cases, right? I forget.

scztt · June 19, 2024, 2:29pm

We can expect that BASICALLY no inner-loop UGen code will get faster in a “compile the whole graph” approach - most optimizations are already being done when compiling the UGen, there won’t be much extra to squeeze out there.

In terms of optimization at the graph level, e.g. between UGens - we can basically measure this now: If you look at a complex audio graph, running at a high CPU load, in a profiler, the heat that isn’t inside of UGen code is potentially something that could get optimized by full-graph compilation. So we can assume that the (totally impossible) best case scenario optimizes out all of this time - and we can more reasonably expect that a small, incremental amount of this cost can be optimized away.

From what I can remember, even when I’m running heavy 90% CPU graphs with thousands of UGens, this graph-level interconnect stuff tends to be taking ~10% of the time, at most. So, if we imagine we make something taking 10% of the time 30% faster, we’re really just shaved 3% off of our overall CPU budget. That would make this a pretty complicated, tricky rewrite of the server (albeit, a very fun and cool one) for a pretty small actual benefit.
(If anyone wants to look into this on their own, they might get different results than me, I’m just recalling the last few times I’ve looked at this for my own projects)

I’d love to be proven wrong on this (because I like these programming projects), but I don’t believe there are any order-of-magnitude performance improvements to be found in the server code by big rewrites or deep compilation tricks. There are probably incremental improvements, but if we’re in the space of incremental performance improvements, we might as well find the easiest ones to do rather than making a huge project out of it. IMO if we have compiler expertise floating around the project, it should go toward the language and not the server?

smoge · June 19, 2024, 2:33pm

I imagined the Ugens would be precompiled to LLVM IR, no lower level than this!

The C++ LLVM lib has an IRBuilder for the first “glue” and overall flow. Later, various LLVM optimizations will still be working on this graph.

for example, once the first phase creates the loop, LLVM could still apply loop unrolling, vectorization, instruction simplification

That’s the theory.

(I think you imagined that I was talking about binary Ugens as of now)=, right? I’m just imagining possibilities)

EDIT: From some docs, starting from the Faust manual, I see:

<llvm/IR/LegacyPassManager.h>: Manages optimization passes. #include <llvm/Transforms/Scalar.h>: Provides scalar transformation passes.
#include <llvm/Transforms/Scalar/GVN.h>: Provides global value numbering optimization.
LLVM Execution Libraries*:
#include <llvm/ExecutionEngine/ExecutionEngine.h>: Abstracts the execution of LLVM code.
#include <llvm/ExecutionEngine/MCJIT.h>: Provides JIT compilation using MCJIT.

     fpm.add(llvm::createPromoteMemoryToRegisterPass());
        fpm.add(llvm::createInstructionCombiningPass());
        fpm.add(llvm::createReassociatePass());
        fpm.add(llvm::createGVNPass());
        fpm.add(llvm::createCFGSimplificationPass());

smoge · June 19, 2024, 3:00pm

Of course, it’s easy to see that it can quickly become a more complex process. Even if optimization is automatic (we don’t create new algorithms to change the graph or anything like this), there are a lot of new libraries and modules to incorporate into your audio engine. And I’m not even sure that’s reliable and how cheaper the computation will become afterward.

I don’t have the numbers

jordan · June 19, 2024, 3:10pm

Is this true for memory usage and therefore cache locality?

I’m not too sure on the sc server specifics here, but could it theoretically optimise to reduce the number of buffers needed, similar to how functional programming optimises away all the temporary allocations in sequential ‘maps’? Which would speed up the actual synths as well?

smoge · June 19, 2024, 3:19pm

It’s not theoretical; it’s verifiable; APL arrays are the ultimate optimization for GPU processing, 20x faster or +. (sort of a random example, but you got the idea)

But again, we should already have more numbers about this stuff. It’s the tech out there; we should know.

Some voices just say NO (for compilers), but for graphs, there is one more phase that LLVM can use to its advantage. It’s not a “fair” comparison as with compilers: Don't Bother With LLVM for Hobby Compilers

---->> If that’s true, it makes more sense to use llvm in an audio engine than in compilers that do everything at once.

An audio engine, building units piece by piece, offers opportunities for incremental optimizations of several forms (in theory).

Graph UGen -> IO Synth is a good study case.

EDIT: @jordan If sometime you want to dig deeper into that, there is this: https://www.youtube.com/watch?v=kZkO3k9g1ps

smoge · June 19, 2024, 4:16pm

I got a minimalistic (toy) audio engine in C++ (+ jack2 and dear imgui) running, with the basics ( equivalents to dynamic ugen function graphs etc), The only real “simplification” is that the tree node is delivered to JACK2 responsibility, but it doesn’t mean there is limitation in design in terms of ugen, synths and similar concepts to sc. It can be a place to test things and others when I have time to play with it. Tweaking massive legacy systems is too complicated to put to the test, and I imagine it is not feasible for SC team resources.

Spacechild1 · June 19, 2024, 5:46pm

SynthDef variants have nothing to do with optimizations! @jamshark70 already pointed you to the actual meaning: SynthDef | SuperCollider 3.12.2 Help

smoge · June 19, 2024, 5:52pm

I got confused because of the other thing. Which is that? A macro?

SynthDef macros are not so developed, but I just love it. It should be a “thing”.

Spacechild1 · June 19, 2024, 6:18pm

Just graph optimizations.

smoge · June 19, 2024, 7:18pm

Sorry, I’m tired. You know, a lotta ins, a lotta outs, a lotta what-have-yous. lotta strands to keep in my head, man.

Spacechild1 · June 20, 2024, 10:01am

No worries! (… and some characters)

Stephane_Letz · June 26, 2024, 11:02am

JIT compiling LLVM IR gain (compared to AOT static compilation when a kind of generic CPU is actually used as target) will mainly be compile for the native CPU you are using. This is especially effective on Intel with all the SIMD consecutive generations you can have. I’m not so sure anymore on ARM. This is what Cmajor language sells also.
but LLVM IR is not a completely agnostic and portable IR model. I remember Google tried to define a portable subset of LLVM IR in the Google Native Client project, but this work was discontinued. So distributing UGen as LLVM IR code would probably be somewhat impossible
then you could use portable DSL like Faust or Cmajor, but this is another story

smoge · June 26, 2024, 7:00pm

Thanks. I didn’t know that. It seems that the LLVM IR has to be precompiled on the running machine.

Faust seems to be at the top of the game regarding those things.

In terms of interoperability, code generated from LLVM iR via JIT ca interoperate with other code compiled from a different language (let’s say C++ and Haskell)?

Is it too exotic? (some special care in the versioning of all the tools would be necessary, I guess)

Stephane_Letz · June 26, 2024, 8:35pm

When you JIT compile, you typically end up with function pointers than you call with parameters. So yes it can interoperate with other code.

scztt · June 28, 2024, 3:34pm

IIRC there is already a pass happening to minimize storage required to process an entire Synth graph - this is primarily happening in sclang (@Spacechild1 may no better, but I don’t recall ANY serious graph optimization on the server, which makes sense bc it would have to be done outside the audio thread anyway). It may be that there are possible improvements here, but they would be incremental at best. Cache locality should already be pretty optimized, since UGens are allocated in the order that they’ll be executed. Allocations from RT_ALLOC are probably not as optimized as they could be, but since only a smalll subset of UGens use this, it might not make a big difference.

smoge · June 28, 2024, 4:17pm

(I don’t remember synth optimization inside an audio thread being mentioned. How would that be possible?)

Update: to be frank, LLVM IR is so messy that I bet it would be much easier to just come up with a code generation written in Haskell.

jamshark70 · June 29, 2024, 12:50am

The UGen sort should “ideally” minimize the number of wire buffers used, but doesn’t always. And fixing this case makes other cases worse.

(
SynthDef(\narrowTallGraph, { |out, freqs = #[100, 200, 300], amps = #[0.1, 0.1, 0.1]|
	Out.ar(out, SinOsc.ar(freqs, 0, amps).sum)
}).dumpUGens;

SynthDef(\shortSquatGraph, { |out|
	var freqs = NamedControl.kr(\freqs, [100, 200, 300]);
	var amps = NamedControl.kr(\amps, [0.1, 0.1, 0.1]);
	Out.ar(out, SinOsc.ar(freqs, 0, amps).sum)
}).dumpUGens;
)

narrowTallGraph -- chain of MulAdds can go on indefinitely without using more wire buffers
[0_Control, control, nil]
[1_SinOsc, audio, [0_Control[1], 0]]
[2_SinOsc, audio, [0_Control[2], 0]]
[3_*, audio, [2_SinOsc, 0_Control[5]]]
[4_MulAdd, audio, [1_SinOsc, 0_Control[4], 3_*]]
[5_SinOsc, audio, [0_Control[3], 0]]
[6_MulAdd, audio, [5_SinOsc, 0_Control[6], 4_MulAdd]]
[7_Out, audio, [0_Control[0], 6_MulAdd]]

shortSquatGraph -- all SinOscs up front = 1 wire buffer per parallel path
[0_Control, control, nil]
[1_Control, control, nil]
[2_SinOsc, audio, [1_Control[0], 0]]
[3_SinOsc, audio, [1_Control[1], 0]]
[4_SinOsc, audio, [1_Control[2], 0]]
[5_Control, control, nil]
[6_*, audio, [3_SinOsc, 5_Control[1]]]
[7_MulAdd, audio, [2_SinOsc, 5_Control[0], 6_*]]
[8_MulAdd, audio, [4_SinOsc, 5_Control[2], 7_MulAdd]]
[9_Out, audio, [0_Control[0], 8_MulAdd]]

… especially irksome since NamedControl is being recommended by some as a general replacement for synth function args.

// although...
SynthArgPreprocessor.install;

(
SynthDef(\noLongerShortSquatGraph, { |out|
	## freqs = [100, 200, 300];
	## amps = [0.1, 0.1, 0.1];
	Out.ar(out, SinOsc.ar(freqs, 0, amps).sum)
}).dumpUGens;
)

… gives you the tall chain of MulAdds.

hjh