Nova::vec SIMD performance

mike · March 29, 2019, 10:05pm

As I am beginning to write UGen’s, I am also trying to implement SIMD versions to get the most out them.

Two questions have come up:

It seems that in my case sin() and cos() are significantly slower on nova vectors—vec<float>— vs. simply iterating through each sample and calculating sin/cos on each sample. (MacBookPro, 2.9 GHz Intel Core i7)

const int vs = nova::vec<float>::size;
const int loops = nSamples / vs;
for (int i = 0; i < loops; ++i)
{
	vec<float> r, r2, sinr, cosr, cosrm2, sinr2, cosr2;
	r.load_aligned(rotation);
	r2 = r * 2;
	sinr = sin(r);
	cosr = cos(r);
	cosrm2 = cosr * 2;
	sinr2 = sin(r2);
	cosr2 = cos(r2);

	cosrm2.store_aligned(cm2);
	sinr.store_aligned(s);
	cosr.store_aligned(c);
	sinr2.store_aligned(s2);
	cosr2.store_aligned(c2);

	rotation += vs;
	cm2 += vs;
	s += vs;
	c += vs;
	s2 += vs;
	c2 += vs;
}

(rotation, cm2, s, etc. are float * for data buffers, but could be I/O buffers, etc…)

While I haven’t meticulously isolated the cost of sin/cos on these vectors, in the context of the UGen I’m writing, simply swapping out a basic sample-iterating pattern that doesn’t use nova::vec:

for (int frm = 0; frm != nSamples; ++frm)
{
	float r = *rotation++;
	float r2 = r * 2;
	float cosr = cos(r);
	*s++ = sin(r);
	*c++ = cosr;
	*cm2++ = cosr * 2;
	*s2++ = sin(r2);
	*c2++ = cos(r2);
}

shows that it’s twice as fast. This seems counterintuitive, as I typically see a pretty decent speedup on things like SIMD binary operations.

Is this expected? How this is implemented is a bit obscure to me…

Is there something obviously wrong with how I’m using nova::vec?
I would imagine specific performance is architecture dependent.
If there is indeed a slowdown, this might be affecting other operations?

This is a speculative question about whether this exists or could be implemented (feature request): Is there something like a “SIMD pointer type” for which there could be defined a custom iterator which steps nova::vec<float>::size?
Such an iterator could make code like that above more concise:

const int vs = nova::vec<float>::size;
const int loops = nSamples / vs;
for (int i = 0; i < loops; ++i)
{
	vec<float> r, r2, sinr, cosr, cosrm2, sinr2, cosr2;
	r.load_aligned(rotation++);
	r2 = r * 2;
	sinr = sin(r);
	cosr = cos(r);
	cosrm2 = cosr * 2;
	sinr2 = sin(r2);
	cosr2 = cos(r2);

	cosrm2.store_aligned(cm2++);
	sinr.store_aligned(s++);
	cosr.store_aligned(c++);
	sinr2.store_aligned(s2++);
	cosr2.store_aligned(c2++);
}

… and potentially get a performance boost?

Thanks for any insights!

VIRTUALDOG · March 30, 2019, 1:00am

Hi Michael,

Have you tried looking at the disassembly for your benchmarking code? Also, what flags are you using to compile?

IIRC gcc and clang are able to optimize code like this into calls to the library function sincos which calculates both transcendental functions at the same time with a negligible overhead. That would be my first guess as to what’s happening here.

Brian

mike · March 30, 2019, 11:55pm

Hi Brian,

There is indeed a call to the library function sincos in the version that doesn’t use SIMD:

+0xd1 callq "DYLD-STUB$$__sincosf_stret"

there is no such call in the SIMD version. I could post the full disassembly if it’s helpful, but there isn’t an obvious one-to-one correspondence. If that would be useful I should probably pair down to a simpler UGen to isolate the functions.

The call stacks look like this:
SIMD

No SIMD

So we can see DYLD sincosf_stret isn’t used in the SIMD version, and the self-weight of the calc function is ~2.5x that of the non-SIMD case.

BUT I am just now noticing the call to lib system_m.dylib, just before HoaRotateLoops:next_5, which is taking 50% of the Graph_Calc. This isn’t present in the SIMD case, so I’m thinking that I should have been looking at the combined weight of these calls to compare the tests, in which case the SIMD case would be faster.

Here’s looking at the total Graph_Calc weight, one level up the call stack:
SIMD
simd2
No SIMD
no-simd2

If this is indeed the case, does a 5% improvement seem like a reasonable SIMD gain?
Would anyone care to sanity check me on this? Still feeling my way through benchmarking…

FWIW I’m using the Cookiecutter template, building in Xcode as Release, and otherwise I haven’t specified any build flags explicitly, aside from enabling NOVA_SIMD, as shown in this PR which modifies CMakeList:

github.com/supercollider/supercollider

CMake plugin template (used by cookiecutter): add NOVA_SIMD build option and include boost.

supercollider:develop ← mtmccrea:topic/plugin-template-additions

opened 11:02PM - 10 Mar 19 UTC

mtmccrea

+9 -0

## Purpose and Motivation The cookiecutter plugin build workflow doesn't curren…tly support nova-simd or include the boost library. ## Types of changes - New feature This adds the `NOVA_SIMD` build option ~~and includes the `boost` library~~ by default so it's available to the cookiecutter plugin build workflow. **Note**: The default `DNOVA_SIMD` flag is `ON`. This gets the job done for making these libraries available to plugins, but I'm open to other options for optionally including these. - [x] Code is tested - plugin built using both libraries - [x] All tests are passing - SC built just fine - [x] Updated documentation - I didn't see any build flag documentation in the cookiecutter repo to update. - [x] This PR is ready for review

scztt · March 31, 2019, 1:08pm

libsystem_m is the OSX core system math library, so time showing up in there is almost surely calculation time. 1.5x speedup for nova is a bit disappointing, but probably a realistic speed-up from vectorization for something like sin.

FYI from the context menu for any of those Instruments line items, you can choose “charge to caller” for either individual functions or whole libraries - this will remove them from the list and add the time spent in them to whoever is calling (the assumption being that e.g. you might optimize how or whether you call libsystem_m, but you won’t be optimizing libsystem_m itself).

mike · April 1, 2019, 6:51pm

Given that the only difference between the two versions of the UGens is in the code snippet in the OP, whether the system math library is called appears to be up to the nova code. So I’m wondering if the nova code isn’t using the system math library and calculating sin/cos in another way. As I mentioned the disassembly of the SIMD version doesn’t show use of sincosf, though if it “knew” to call it, it might actually be faster. Just speculation…

Assuming I’m comparing apples to apples, and interpreting the Time Profiler weights correctly, the speedup from boost appears to be only about 1.06x, which I’d hoped would be a bit more.

I’m wondering if the reason it isn’t calling sincosf is because, on account of vectorizing the sin and cos operations separately, the compiler doesn’t doesn’t “see” the optimization in calculating them together??

I’d imagine if there’s anything actionable here, it would involve looking deeper into how nova is delegating operations, but I can’t tell at this point it’s worthwhile or I’d be chasing my tail.

The hope would be that if this reveals an opportunity to revisit/revise this operation delegation in nova, any potential gains would trickle up to all the UGens using it. Though that would have to be left to someone familiar with the nova design…