Question: "Vectorization for Dummies"

jamshark70 · August 20, 2021, 7:55am

This question is related to Stealing VCV Rack's analog "exciter" formula - #2 by jamshark70 –

I had thought it might be faster to wrap up the exciter formula into a single UGen. So, with the help of the cookiecutter script, I did get it to compile fairly quickly, and found… that it’s slightly slower.

It’s not hard to imagine why – the single UGen eliminates several function calls (for multiple math-op UGens), but I didn’t write vector opcodes – so the vector opcodes in SC’s math operators are a more powerful optimization.

… which makes me curious if my unit would run better with vector ops.

The catch is… I’m not that good in C++. I had thought maybe I could get some tips from e.g. MulAddUGens.cpp, but it seems that everything is heavily templated and I can’t make much sense out of it. (Nothing wrong with templated code, of course – here, the world has passed me by…)

So… if I have this naïve implementation:

void AnalogCurve::next(int nSamples) {
    const float* input = in(0);
    float factor = *in(1);
    float offset = *in(2);
    
    float* outbuf = out(0);

    // Distortion formula stolen from
    // https://github.com/VCVRack/Fundamental/blob/v1/src/VCO.cpp#L23
    for (int i = 0; i < nSamples; ++i) {
        float x = input[i];
        float x01 = x * 0.5f + 0.5f;
        // denominator: 0-1 * 2 --> 0-2, then +3 --> 3-5 == x+4
        *(outbuf++) = (((x01 * factor + offset) * x01 + 3.f) / (x + 4.f));
    }
}

… how to vectorize it? I’m afraid I really would need the for-dummies version

hjh

Spacechild1 · August 20, 2021, 12:44pm

The best way is to use a proper SIMD library, like nova-simd (which is used by SuperCollider). However, there is a bit of a learning curve.

Luckily, modern compilers are quite good at auto vectorization. SIMD vectors are typically 128 or 256 bits wide (some recent additions like AVX512 extend this to 512 bits, but it is not widely used yet), which corresponds to 4-8 single precision floating point numbers.

You can have two calc functions: a generic one that works for all buffer sizes and a specialized one for multiples of 8. In the constructor you check if the buffer size a multiple of 8 (e.g. (mBufLength & 7) == 0) and select the appropriate calc function.

In the specialized calc function you can now unroll the loop.

First, you might try to do something like this

void AnalogCurve::next(int nSamples) {
    const float* input = in(0);
    float factor = *in(1);
    float offset = *in(2);
    
    float* out= out(0);

    // Distortion formula stolen from
    // https://github.com/VCVRack/Fundamental/blob/v1/src/VCO.cpp#L23
    for (int i = 0; i < nSamples; i += 8) {
        for (int j = 0; j < 8; ++j) {
            float x = input[i+j];
            float x01 = x * 0.5f + 0.5f;
            out[i+j] = (((x01 * factor + offset) * x01 + 3.f) / (x + 4.f));
        }
    }
}

Unfortunately, this not help much with vectorization [^1]. Why? The input and output pointers might overlap, so whenever you write to out it could theoretically change any input value. As a consequence, the compiler is forced to fetch each float from memory seperately. Of course, we know that input and output can only overlap at the same sample index, but the compiler doesn’t have this knowledge.

So we have to take it a step further:

void AnalogCurve::next(int nSamples) {
    const float* input = in(0);
    float factor = *in(1);
    float offset = *in(2);
    
    float* out= out(0);

    // Distortion formula stolen from
    // https://github.com/VCVRack/Fundamental/blob/v1/src/VCO.cpp#L23
    for (int i = 0; i < nSamples; i += 8) {
        // first cache the input samples
        std::array<float, 8> vx;
        for (int j = 0; j < 8; ++j) {
            vx[j] = input[i+j];
        }
        // now calculate
        for (int j = 0; j < 8; ++j) {
            float x = vx[j];
            float x01 = x * 0.5f + 0.5f;
            out[i+j] = (((x01 * factor + offset) * x01 + 3.f) / (x + 4.f));
        }
    }
}

This basically tells the compiler that it can store the next 8 input samples in SSE register(s) and safely vectorize. (All variables are now on the stack and writing to the output buffer can’t have any side effects.) The optimizer should completely remove the additional stack copy.

Here’s some experiments: Compiler Explorer
Notice how fun3_8 and fun4_8 only have a couple of instructions, although the C++ code appears to be larger?

Actually, you can generalize the calc function with a template parameter:

template<std::size_t N>
void AnalogCurve::next(int nSamples) {
    const float* input = in(0);
    float factor = *in(1);
    float offset = *in(2);
    
    float* out= out(0);

    // Distortion formula stolen from
    // https://github.com/VCVRack/Fundamental/blob/v1/src/VCO.cpp#L23
    for (int i = 0; i < nSamples; i += N) {
        // first cache the input samples
        std::array<float, N> vx;
        for (int j = 0; j < N; ++j) {
            vx[j] = input[i+j];
        }
        // now calculate
        for (int j = 0; j < N; ++j) {
            float x = vx[j];
            float x01 = x * 0.5f + 0.5f;
            out[i+j] = (((x01 * factor + offset) * x01 + 3.f) / (x + 4.f));
        }
    }
}

Now you can use the same code for both cases: next<8> for the unrolled case and next<1> for the general case (the compiler will remove the unnecessary inner loops except for Debug builds). You can even use next<4> and next<2> for buffer sizes of 4 resp. 2, but you have to check if the resulting code bloat actually doesn’t make things worse.

You can give it a try and see if it gives you enough performance benefits to justify the ugliness

Pd uses this crude technique for many of its ugens, and that’s where I have learned it.

BTW, for max. performance, make sure to configure the plugin with NATIVE=ON. The default configuration only enables SSE2, which has 128 bit vectors. Most modern computers support AVX2 which has 256 bit vectors. However, the resulting binary will not be portable.

[^1]: it can work if you decorate all input and output pointers with __restrict which tells the compiler that they don’t alias each other. However, I’m not sure if this is always safe, since the pointers can alias, but only on the same sample index. Also, __restrict is not standard C++.

jamshark70 · August 21, 2021, 9:08am

Hi – thanks for the detailed answer!

I wasn’t able to figure out the template version (“next<8> for the unrolled case and next<1> for the general case” – I’m not clear how to do that), but I did add a second next8 function and hooked it in for isAudioRateIn(0).

Benchmarking results, running 100 instances of the same formula in series:

(
var f = { |x|
	var x01 = x * 0.5 + 0.5;
	((x01 * (-7)) * x01 + 3) / (x + 4)
};

a = {
	var sig = SinOsc.ar(110);
	100.do {
		sig = AnalogCurve.ar(sig, -7, 0);
		// sig = f.(sig);
	};
	(LeakDC.ar(sig) * 0.11).dup
}.play;
)

Baseline scsynth CPU reading: 1.1 - 1.3%
Formula using SC math operators: ~3.7%
AnalogCurve unit, older version without the 8-point loop: 4.7 - 5.2%
AnalogCurve unit, with 8-point loop: 1.8 - 1.9%

… which, if you subtract from the baseline, is a rather stunning (4.9 - 1.2) / (1.85 - 1.2) = 3.7 / 0.65 = 5.7 times speedup of the UGen code, and 3.8 times faster than the SC math operators.

I would call that a successful optimization. And I never would have guessed this – so, thanks for taking the time.

I feel comfortable to go ahead and use this locally (and hope I don’t have to move it to another machine…).

See https://github.com/jamshark70/analogcurve/blob/master/plugins/AnalogCurve/AnalogCurve.cpp

hjh

Spacechild1 · August 21, 2021, 10:07am

I wasn’t able to figure out the template version (“next<8> for the unrolled case and next<1> for the general case” – I’m not clear how to do that), but I did add a second next8 function and hooked it in for isAudioRateIn(0).

The templated function just replaces the hardcoded number 8 with a generic number N. But it’s totally ok to simply write two different versions of the calc function.

BTW, isAudioRateIn(0) is not sufficient, you also have to check if the buffer size is a multiple of 8! Otherwise the ugen can crash or behave weirdly. Actually, since the buffer size is always a power of 2 (at least I think so), you can also simply do mBufLength >= 8 instead of (mBufLength & 7) == 0.

5.7 times speedup of the UGen code, and 3.8 times faster than the SC math operators.

Cool! Since your machine likely supports AVX2, it uses 256 bit vectors (= 8 single precision floats), so a 5.7 speed up is definitely in the range of what I would have (optimistically) expected.

And I never would have guessed this – so, thanks for taking the time.

The dark arts of audio programming

madskjeldgaard · August 23, 2021, 4:41pm

whooaaa this is such a nice trick. Thanks for sharing!!!

James: Will you create a seperate thread or ping here when you think your plugin is ready for release/use? Very curious to try it out when you think it’s ready. PS there’s definately more VCV stuff that would be interesting to port. I already ported some of the chow vcv things to mk-plugins (except Werner which I couldn’t get to work with Eigen lib etc.

madskjeldgaard · August 23, 2021, 4:42pm

In the linked vcv repo for Jatin Chowdhury there’s a really cool pulse shaper that I would love to see in SC. It is some of the circuitry magic in the 808 sound (the manual has a link to a paper I think) but I couldn’t wrap my head around how to do it.

Anyway, sorry for going off topic. Keep up the good work!

jamshark70 · August 23, 2021, 10:53pm

Sure. Still to do is de-zippering the control inputs.

Actually, if there’s no need to modulate, this could all be done with Shaper anyway

// EDIT: Add Shaper example
(
s.waitForBoot {
	// -2 to 2 range
	var n = 1024;
	var xfer = Signal.fill(n, { |i|
		var x = (i / n * 4) - 2;
		var x01 = x * 0.5 + 0.5;  // -1 - +1 --> 0 - 1
		(((x01 * 5 - 13) * x01 + 3) / (x + 4)).neg
	});
	b.free;
	b = Buffer.sendCollection(s, xfer.asWavetable, 1);
};
)

(
a = {
	var sig = VarSaw.ar(220, 0, 0.1);
	// because transfer function is -2 - +2, halve the input
	var sig2 = Shaper.ar(b, sig * 0.5);
	[sig, sig2, LeakDC.ar(sig2)]
}.plot;
)

Anyway… Current state is that I have a UGen that supports only 256-bit vector ops and is tied to my CPU (“native=on”). For release, I suppose it should also be able to use 128-bit vectors on older CPUs. I don’t know how to detect the CPU’s vector capabilities at runtime, and I’m not sure how to compile Windows and Mac binaries on my Linux system. I don’t think it’s ready for release without addressing these.

hjh

madskjeldgaard · August 24, 2021, 7:13am

Yeah I’m very curious how this workflow works on older architectures / raspberry pis.

Regarding release - you can make github do it for you using actions

Spacechild1 · August 24, 2021, 8:44am

You don’t need to, just compile for the oldest architecture you want to support. With NATIVE=OFF, the cookie cutter CMake file will only enable SSE and SSE2, which are actually mandatory for 64-bit Intel based processors. It is also pretty safe to enable SSE3 which has been introduced in 2004. However, I once enabled SSE4 for VSTPlugin, but then got a bug report from a user with an older AMD machine where it would throw a SIGILL signal (= illegal instruction). Now I only ship SSE3 builds again.

For your code, SSE4 probably doesn’t give any advantage over SSE2/3. The big deal is AVX (which doubles the vector size), but it’s not safe yet to enable it by default. If people want the extra performance, they would have to build from source.

Spacechild1 · August 24, 2021, 9:39am

Generally, ARM comes in different versions: ARMv6, ARMv7, ARMv8. ARMv8 also supports 64-bit mode, which is also called ARM64 (or aarch64). For 32-bit, you can either compile binaries for each version, or pick the lowest one you want to support (they are backwards compatible). As you would expect, higher ARM versions can produce faster code.

The situation on Raspberry Pi is a bit confusing. The original RPi1 has an ARMv6 CPU. RPi2 switched to an ARMv7 CPU, but the Raspbian OS packages remained ARMv6 (for backwards compatibility). RPi3 has an ARMv8 CPU, but it would still run in 32-bit mode, because 64-bit mode would be rather pointless (typically you don’t even have that much RAM in the first place, so you just waste memory).

If that wasn’t confusing enough: ARMv6 doesn’t guarantee hardware floating point support, so standard Debian for ARMv6 uses software floating point emulation by default (slow!). Since the RPi1 does support hardware floating point, Raspian used a custom Debian version compiled for ARMv6hf (= ARMv6 with hardware floating point). Now with the RPI2 and above you can actually use a standard Debian ARMv7hf image.

Long story short: if you want your code to run on a RPi1, compile for ARMv6hf, otherwise use ARMv7hf. You can google the appropriate compiler flags. Most importantly you need -mfloat-abi=hard