The best way is to use a proper SIMD library, like `nova-simd`

(which is used by SuperCollider). However, there is a bit of a learning curve.

Luckily, modern compilers are quite good at auto vectorization. SIMD vectors are typically 128 or 256 bits wide (some recent additions like AVX512 extend this to 512 bits, but it is not widely used yet), which corresponds to 4-8 single precision floating point numbers.

You can have two calc functions: a generic one that works for all buffer sizes and a specialized one for multiples of 8. In the constructor you check if the buffer size a multiple of 8 (e.g. `(mBufLength & 7) == 0`

) and select the appropriate calc function.

In the specialized calc function you can now unroll the loop.

First, you might try to do something like this

```
void AnalogCurve::next(int nSamples) {
const float* input = in(0);
float factor = *in(1);
float offset = *in(2);
float* out= out(0);
// Distortion formula stolen from
// https://github.com/VCVRack/Fundamental/blob/v1/src/VCO.cpp#L23
for (int i = 0; i < nSamples; i += 8) {
for (int j = 0; j < 8; ++j) {
float x = input[i+j];
float x01 = x * 0.5f + 0.5f;
out[i+j] = (((x01 * factor + offset) * x01 + 3.f) / (x + 4.f));
}
}
}
```

Unfortunately, this not help much with vectorization [^1]. Why? The input and output pointers might overlap, so whenever you write to `out`

it could theoretically change *any* input value. As a consequence, the compiler is forced to fetch each float from memory seperately. Of course, *we* know that input and output can only overlap at the same sample index, but the compiler doesn’t have this knowledge.

So we have to take it a step further:

```
void AnalogCurve::next(int nSamples) {
const float* input = in(0);
float factor = *in(1);
float offset = *in(2);
float* out= out(0);
// Distortion formula stolen from
// https://github.com/VCVRack/Fundamental/blob/v1/src/VCO.cpp#L23
for (int i = 0; i < nSamples; i += 8) {
// first cache the input samples
std::array<float, 8> vx;
for (int j = 0; j < 8; ++j) {
vx[j] = input[i+j];
}
// now calculate
for (int j = 0; j < 8; ++j) {
float x = vx[j];
float x01 = x * 0.5f + 0.5f;
out[i+j] = (((x01 * factor + offset) * x01 + 3.f) / (x + 4.f));
}
}
}
```

This basically tells the compiler that it can store the next 8 input samples in SSE register(s) and safely vectorize. (All variables are now on the stack and writing to the output buffer can’t have any side effects.) The optimizer should completely remove the additional stack copy.

Here’s some experiments: Compiler Explorer

Notice how `fun3_8`

and `fun4_8`

only have a couple of instructions, although the C++ code appears to be larger?

Actually, you can generalize the calc function with a template parameter:

```
template<std::size_t N>
void AnalogCurve::next(int nSamples) {
const float* input = in(0);
float factor = *in(1);
float offset = *in(2);
float* out= out(0);
// Distortion formula stolen from
// https://github.com/VCVRack/Fundamental/blob/v1/src/VCO.cpp#L23
for (int i = 0; i < nSamples; i += N) {
// first cache the input samples
std::array<float, N> vx;
for (int j = 0; j < N; ++j) {
vx[j] = input[i+j];
}
// now calculate
for (int j = 0; j < N; ++j) {
float x = vx[j];
float x01 = x * 0.5f + 0.5f;
out[i+j] = (((x01 * factor + offset) * x01 + 3.f) / (x + 4.f));
}
}
}
```

Now you can use the same code for both cases: `next<8>`

for the unrolled case and `next<1>`

for the general case (the compiler will remove the unnecessary inner loops except for Debug builds). You can even use `next<4>`

and `next<2>`

for buffer sizes of 4 resp. 2, but you have to check if the resulting code bloat actually doesn’t make things worse.

You can give it a try and see if it gives you enough performance benefits to justify the ugliness

Pd uses this crude technique for many of its ugens, and that’s where I have learned it.

BTW, for max. performance, make sure to configure the plugin with `NATIVE=ON`

. The default configuration only enables SSE2, which has 128 bit vectors. Most modern computers support AVX2 which has 256 bit vectors. However, the resulting binary will not be portable.

[^1]: it can work if you decorate all input and output pointers with `__restrict`

which tells the compiler that they don’t alias each other. However, I’m not sure if this is always safe, since the pointers *can* alias, but only on the same sample index. Also, `__restrict`

is not standard C++.