Proposal: SynthDefs could eliminate duplicate UGens

I optimized an audio processing graph using a few common techniques in an experimental’ toy’ project. Still, the ugens were not compiled before the synthdef, and I just work with graph data types directly, all in C++. The optimizations (nothing new) are topological sorting, constant node preprocessing, buffer pooling, and eliminating redundant buffers (the last two are ways to recycle buffers). That’s all. (CSE would not improve much I think, since unordered_map already helps with node-to-buffer mappings, but that’s something else to try),

SC probably implements all those things, but I’m unfamiliar with the details.

The result was a 3.5-fold increase in performance in some random examples, even with simple graphs. It doesn’t need complex graphs (one simple one went from 42us to 12us). I suspect (experts can tell me better) that the compilers applied some SIMD optimizations by themselves since the gain was a bit too much. I also use a SIMD lib but haven’t bench it yet. Even basic optimizations can make a big difference, it seems.

Btw I’ve been using this in production for several months. It works.

This approach doesn’t use a CSE pass – it keeps UGens in a cache according to UGen class, rate and arguments. Only pure math ops are cached in that version but it could be extended to other pure UGens.

hjh

1 Like

@jamshark70 don’t suppose you have any benchmarks, or anecdotal opinions about the performance?

Not really. It probably slows down SynthDef building slightly, and the redundant ops that it eliminates are already pretty fast.

The benefit could be increased by marking more UGens as pure.

The SynthDef-building speed penalty could be reduced by making it possible to determine whether a UGen is pure or not without creating the UGen instance. That’s straightforward for most UGens, but not so easy for unary or binary math ops. Currently it creates the instance, and if pure, checks for a cache match, and if there’s a hit, discards the new instance. That could be streamlined. (This is one reason why I called it a proof-of-concept – it’s enough to prove the concept but not ready to merge.)

hjh

I ask because I was curious if this could have a negative effect due to cache.

If you had r1 = a + b + c; r2 = a + b + d it could be faster to not pull out the a + b if it avoids a cache miss. I wonder if the vectorised math ops might suffer a performance loss from this proposal?

I’m not referring to CPU cache at all.

Without the UGenCache, r1 will end up as a Sum3 with a, b and c as inputs, and r2 as a different Sum3 with a, b and d as inputs.

With UGenCache (this is without testing – just guessing and writing on my phone), a + b would be done once on its own, and shared between r1 = that + c and r2 = that + d.

I don’t know which is more efficient. The pair of Sum3 units does 4 additions (one redundant) but only 2 calc function dispatches. The other way does one less addition but one more function dispatch. (But this is only an issue because the Sum3 optimization exists – I guess that conflicting optimizations could be hard to resolve.)

Tbh this is such a fine level of performance analysis that the vast majority of cases I’d guess would be indistinguishable in terms of speed.

hjh

1 Like

We should never forget that performance is opaque in most languages, not only C++. Haskell can be hard to optimize without going down in the intermediate language. ALSO, c++ and optimization, such as SIMD operations, can change from hardware to hardware from time to time since hardware changes.

And SC’s library is not maintained anymore; I am just saying…

Google has a good framework for benchmarking C++ so that you can compare nanoseconds to your changes. Haskell has a popular library, too; I use them when that’s the goal. For example, I measured the experiments I described in this discussion. In the case of real-time audio, you need to know not only it has good average performance, but predictable performance is as important or even more important,

Without those things, it would be just guessing, which is problematic. Even then, you might not know precisely what happened.

Some things, like topological sorting, are uncontroversial because the order of computation is transparent to us. Other things, not so much.

Example: LLVM, with all the hype on optimization, delivered a terrible performance in many, many compilers: Don't Bother With LLVM for Hobby Compilers

So, let’s be careful with optimization strategies without working on tools everyone can measure; let’s make this a habit. The risk is not thinking about our data (our material) and starting to create models in our heads.

I’m just saying: back to basics, and work on tools specific to SC.

@jamshark70, your changes are not aggressive and seem fine to me. I’m talking in general terms here.

Just an idea.

So I made some rough benchmarks to see what would happen and it seems like I’ve been able to answer my own question.

TLDR: Doing less math ops is better that doing more local ops. Seems that reducing the number of function calls should be the priority.

I think this result would suggest that Sum3 is going to be faster.


I tested
Duplicate

a + b + c;
a + b + d;

against No Duplicate.

r = a + b;
r + c;
r + d;
#include <span>
#include <array>
#include <vector>
#include <cstdint>
#include <iostream>

static constexpr size_t N = 1024;

using signal_span = std::span<float, N>;
using signal_store = std::array<float, 10'000>;
using Fn = void(*) (signal_span lhs, signal_span rhs, signal_span output);

void add(signal_span lhs, signal_span rhs, signal_span output) {
  for(size_t i{0}; i < N; ++i)
    output[i] = lhs[i] + rhs[i];
}

void mul(signal_span lhs, signal_span rhs, signal_span output) {
  for(size_t i{0}; i < N; ++i)
    output[i] = lhs[i] * rhs[i];
}

struct Instruction {
  Fn func;
  int16_t lhs;
  int16_t rhs;
  int16_t output;
};

static void duplicate(benchmark::State& state) {
  float signal_store[1'000'000];

  std::vector<Instruction> instr;
  instr.emplace_back(add, 0, N, 2 * N); //a + b
  instr.emplace_back(add, 0, N, 2 * N); // b + c
  
  instr.emplace_back(add, 1000 + N, 1000 + N*2, 1000 + N*3); // a + b
  instr.emplace_back(add, 10000 + N, 10000 + N*2, 10000 + N*3); // b + d

  for (auto _ : state) {
    for(auto& ins: instr){
      ins.func(
        std::span<float, N>(signal_store + ins.lhs, N), 
        std::span<float, N>(signal_store + ins.rhs, N), 
        std::span<float, N>(signal_store + ins.output, N)
      );
    }

    benchmark::DoNotOptimize(instr);
  }
  float ff = 0.0;
  for(auto& f : signal_store)
    ff += f;
  std::cout << ff << std::endl;
}
BENCHMARK(duplicate);

static void no_duplicate(benchmark::State& state) {
  float signal_store[1'000'000];
  std::vector<Instruction> instr;
  instr.emplace_back(add, 0, N, N*2); // a + b

  instr.emplace_back(add, N*2, 1000, 1000 + N); // b + c
  instr.emplace_back(add, N*2, 100'000, 100'000 + N); // b + d assume big jump between input and output

  for (auto _ : state) {
    for(auto& ins: instr){
      ins.func(
        std::span<float, N>(signal_store + ins.lhs, N), 
        std::span<float, N>(signal_store + ins.rhs, N), 
        std::span<float, N>(signal_store + ins.output, N)
      );
    }

    benchmark::DoNotOptimize(instr);
  }
  float ff = 0.0;
  for(auto& f : signal_store)
    ff += f;
  std::cout << ff << std::endl;
}
BENCHMARK(no_duplicate);

Some things I learnt.

If the inputs and output pointers alias things are much slower, I guess the compiler adds a check for this.

Block size has no measureable effect.

Is the “no_duplicate” method approximately 25% faster? That’s it?

(something like this to measure sc-specific code would be very good)

For a large block size, yes.
For a small block size, this decreases

This is a block size of 1.

Despite the non-locality of data here, its still faster. Seems the number of function calls is what should be avoided (assuming same implementation).

The assembly code shows

duplicate: 25.64% on moves, ‘copy’ float values between memory, a lot of time moving data around, rather than computing stuff.

no_duplicate:

54.05% on addss (actual computation number addition)
14.86% on movss (much less time moving things around in memory)

This is not suprising. The “process” function calls have a roughly constant overhead. If the block size is large, the actual DSP outweights the function call overhead; if the block size is very small, the function call overhead becomes more and more prominent. Once the block size drops below 8 samples, things get significantly worse because we can’t use the full width of the SIMD registers.

Note that the “process” function call overhead also includes any prolog/epilog code (e.g. saving member variables into local variables and vice versa).

@smoge I think you’re only counting the actual DSP code. If you include the constant factors, as explained above, you will get a different picture.

1 Like

Another comment: always take microbenchmarks with a big grain of salt. If you only test with a handful of Ugens, you might not get the full picture. In particular, negative cache effects only appear once there’s enough data to fill up the L1 cache.

@jordan It seems like you construct only a single “Synth” (with 3 resp. 4 “Ugens”) and then process the same Synth over and over again. Try the same with more Synth instances and compare the results!

1 Like

In practice, you need to test with three independent variables:

  1. duplicate vs. no_duplicate
  2. block size
  3. number of Synths/Ugens

I knew there’d be a trade off, just wasn’t sure how they compare. Seems like function calling is actually quite slow comparatively.

I did not consider L1! I’ll have a look later, but that would be a more realistic test. Thanks! :blush:

2 Likes

That only happens with the compilation flag -O3, not -O2 (which keeps 1.2 faster).

-O3 optimizes duplicate for us with goodies but no error checking, -O2 not

I don’t think many people compile sc with -O3, for instance.

What I can get:

-O3 uses SIMD instructions ( movdqu movups and similar)
Do function calls remain the same? I think so.
(I think) It also avoids error checking and does not set up before the main loop. -O2 keeps the setup.
it places the comparison cmp %rbx,%rbp at the end

; Main loop body
41.42% movswq 0xc(%rbx),%rax
0.07%  lea    0x20(%rsp,%rax,4),%rdx
       movswq 0xa(%rbx),%rax
0.07%  lea    0x20(%rsp,%rax,4),%rsi
40.42% movswq 0x8(%rbx),%rax
       lea    0x20(%rsp,%rax,4),%rdi
       callq  *(%rbx)
       add    $0x10,%rbx
5.36%  cmp    %rbx,%rbp
       jne    407ea0 <duplicate(benchmark::State&)+0x130>

Seems dubious, citation needed perhaps?


Edit.

Pretty sure cmake’s defaults for release mode are -O3 -DNDEBUG so almost everyone will be -O3.

2 Likes

I remember warnings against -O3 here and there. I think it’s a thing of the past now

" While historically -O3 could expose compiler bugs, nowadays issues with -O3 are almost always instances of undefined behavior (UB) and the code is at fault and should be fixed."

For context: The stated thread topic is about removing redundancy from SynthDefs. The recent discussion of compiler optimizations is, then, to try to determine whether there are cases in which a SynthDef might perform better with redundant operations?

hjh

Yes. I think your own example in Proposal: SynthDefs could eliminate duplicate UGens - #36 by jamshark70 could be such a case, at least under certain circumstances.