Performance of SC on OSX

dkmayer · August 18, 2020, 1:51pm

This is a topic that isn’t affecting my usage of SC all too much, though: my impression is that the performance of SC on OSX has become considerably worse within the last decade. E.g. CPU usage on my more than 7-year old iMac (10.10) is 2-3 times lower than on my MacBook Pro from 2018. It doesn’t have to do with SC versions.

I’ve just updated to Catalina and deactivated hyperthreading, following the recommendation of James in this parallel discussion, it improved ca. 10 %:

https://scsynth.org/t/what-limits-the-maximum-server-workload-in-sc

I checked with all system settings recommended here, but no change:

Any further recommendations or ideas? James mentions disabling CPU frequency scaling, does anyone have experiences with this?

scztt · August 18, 2020, 3:40pm

Raw CPU usage isn’t a good point of comparison - it’s likely that a big part of what you are seeing is simply the OS/hardware more efficiently setting the clock speed of the CPU to match how much it’s being utilized. From a theoretical perspective, perfect energy efficiency would mean you’d be at 100% CPU usage at all times - because your CPU would only be using the exact number of clock cycles it needs to complete it’s work. This would also explain differences between an iMac and a MacBook, since the latter is clocked much more aggressively to minimize power usage.

Having said that - aggressive clocking DOES still tend to hurt the actual performance of realtime applications: I find that, on any recent MacBook, I hit my CPU limit in SuperCollider and other realtime audio apps BEFORE I’ve reached a level of utilization that cause the clock speed to shift upwards. This obviously shaves off some amount of CPU time I could be using for audio. In my experience, realtime audio performance on Apple products is not really predicted by regular CPU benchmarks or clock speed - so e.g. there have been big processor speed bumps that have shown very little improvement for audio. I haven’t yet seen an example of a processor bump that caused a decrease, but it’s possible if it was paired with e.g. much more aggressively clocking change in the BIOS?

A few things that might be helpful when thinking about performance issues:

“Peak CPU” is a good shorthand measurement in SC, but the only real measurement is: how much can you run before you get drop-outs.
If you want to do a rigorous performance comparison, make a little Synth that contains a good variety of different kinds of UGens (make sure to include some memory-access heavy UGens like delays/comb filters), and consumes on the order of 3-4% CPU by itself. Then, start playing them until you get drop-outs: that count represents your CPU budget on that machine.
The “disable hyperthreading” tip is at least 10 years old. Processor architectures have evolved massively in that time, and no one is really running realtime audio benchmarks around this (including audio software companies and, probably, Apple). If anyone is considering the hyperthreading trick, I would highly suggest building a couple test cases like I described in [1] and comparing the hyperthread-vs-non-hyperthread performance on your own machine.
The biggest thing you can do for a laptop - potentially even beyond the hyperthreading hack - is to plug it in, and keep it cool. It’s extremely easy to raise the temperature of your laptop in a matter of a few minutes, and this WILL down-clock your CPU. And, of course, an unplugged power cord is likely to trigger a lot of more aggressive clocking conditions as well.
I would bet dinner and a beer that a dorky gaming laptop stand that has built-in fans and active cooling would net you an extra 5% CPU headroom in SuperCollider after an hour of use.
Use the largest hardware buffer size you can get away with based on your latency requirements. It can be pretty high even if you’re e.g. triggering with a MIDI keyboard - and if you’re only turning knobs, even higher (2048+). This especially applies to memory-access-heavy workflows: reading/writing a lot of buffers, lots of long delays, granular synthesis - because the worst case execution time of these things can be much more unpredictable due to cache misses.
If anyone runs any CPU headroom tests in SC, please report how you did them, and what you saw back here! I’m sure it will be very helpful to others trying to configure their systems as well.

fmiramar · August 18, 2020, 3:48pm

Could this also be something regarding apple T2 Security Chip ? I don’t have enough knowledge on hardware and OS to judge this video, but I would like to know if stuff here are correct (and if it does affect SC):

scztt · August 18, 2020, 3:53pm

A lot of the tips in that Sweetwater article strike me as questionable at best… FileVault already had a negligible performance impact 8-10 years ago, with mechanical hard drives - no one should disable this. It’s good to quit background apps and processes of course, but disabling your firewall is ridiculous and dangerous.

The one piece of good advice that may seem a little unexpected: turn off automatic clock sync, as there’s a known issue with the SuperCollider server where a big clock resync can cause crashes when there are scheduled message in-flight.

scztt · August 18, 2020, 3:55pm

This would affect SC, for sure. The T2 thing was an issue specific to external USB audio devices, IIRC - I thought it was fixed as well, but I could be wrong? Definitely worth making sure your system is as up-to-date as possible.

Sam_Pluta · August 18, 2020, 3:54pm

When I bought my 2016 macbook pro, this was the first machine that I had bought in a while that was less powerful per core than my previous machine, a 2012 mbpro. This is exactly why I moved to my multi-server setup. The new machine couldn’t run my software!!!

The other thing is, the multi-core issue is not going away. Apparently the arm cpu’s will have 12 cores at first? We will either need to run multiple servers or supernova needs to be brought to parity with server. I can’t do the latter, so that I why I used the multi-server approach.

Sam

dkmayer · August 18, 2020, 11:04pm

Thanks for all your replys, that broadened the picture and confirmed my concerns! I will do some tests following @scztt’s point (1), which looks very reasonable to me, and report back then.

dkmayer · August 18, 2020, 11:21pm

@scztt, thanks for your detailled answer! I always had an eye on peaks too, but the p/a ratio seemed to be pretty the same on the two machines. However, as said, I will do some benchmarking with different synths as you suggested in (1).
Ad (2), maybe this is more efficient on Linux, as reported in the parallel thread. At least, it had some impact on OSX too.
Ad (4), when working with Patterns in realtime this is hardly feasible. It’s not only about latency, accuracy gets worse ( https://scsynth.org/t/imperfection-of-language-based-timing ). The clock calibration leads to larger deviations (I have not included this in my post then, but did run such tests that had this effect). However, concerning CPU usage it’s a point I haven’t considered – will check. When not working with Patterns it would definitely make sense.

dkmayer · August 20, 2020, 12:11am

Hello,

here some quick tests with three SynthDefs that took 3-4 % average CPU on my Desktop. I find the results very interesting. To sum up: it turned out that average CPU is higher on the laptop, In the end dropouts are reached at pretty the same amount of synths. This is good news as it shows that there’s at least no decline in these examples as one could assume by higher average CPU (and it confirms Scott). A larger hardware buffer size (went from 512 to 4096 with built-in audio) could raise the limit by a third in some examples, so this is really a good option (when not using patterns)!

I got the limits by taking the largest number of synths that didn’t give dropouts by moving the IDE window (rough check indeed). This conincided with avoiding to go above 90 % with peak CPU (which could be difficult to estimate with many LFOs e.g.).

desktop: iMac 3.2 GHz i5, 2013, OSX 10.10
laptop: MacBook 2.2 GHz 6-core i7, 2018, OSX 10.15

I’ve used a Normalizer at the end of the signal chain, but be careful with amplitudes anyway because of dropouts.

// boot with extended ressources
(
s.options.maxNodes = 1024 * 64;
s.options.memSize = 8192 * 16;
s.options.numWireBufs = 64 * 16;

s.reboot;
)


// SynthDefs, n chosen to reach 3-4 % avg CPU
(
SynthDef(\cpu_test_1, { |out, freq = 400, amp = 0.1|
	var n = 350;
	var sig = SinOsc.ar(freq ! n, 0, amp) / n;
	Out.ar(out, Mix(sig) ! 2 * EnvGate())
}).add;

SynthDef(\cpu_test_2, { |out, freq = 400, amp = 0.1|
	var n = 100;
	var lfo = SinOsc.ar(10).range(10, 50);
	var sig = { |i| VarSaw.ar(lfo, 1 / (i + 1)) } ! n;
	sig = BPF.ar(sig, freq, 0.01) * amp * 100 / n;
	Out.ar(out, Mix(sig) ! 2 * EnvGate())
}).add;

SynthDef(\cpu_test_3, { |out, freq = 400, amp = 0.1|
	var n = 80;
	var lfo = SinOsc.ar(10).range(10, 50);
	var sig = { GrainSin.ar(1, Impulse.ar(lfo), 0.05, freq, mul: amp * 0.2) } ! n / n;
	Out.ar(out, Mix(sig) ! 2 * EnvGate())
}).add;

SynthDef(\normalizer, { |inBus, amp = 0.1|
	Out.ar(0, Normalizer.ar(In.ar(inBus, 2), amp))
}).add;

~bus = Bus.audio(s, 2);
)


// produces num synths playing all to the normalizer bus
(
~makeCpuTest = { |synthType = 1, num = 10, midi = 60, amp = 0.1|
	var instr = \cpu_test_ ++ (synthType.asString), synths, group;

	Task {
		group = Group();
		0.2.wait;
		(
			instrument: \normalizer,
			inBus: ~bus,
			amp: amp,
			group: group,
			addAction: \addAfter
		).play;
		synths = num.collect {
			(
				instrument: instr,
				dur: inf,
				midinote: midi,
				amp: amp,
				group: group,
				out: ~bus
			).play
		}
	}.play;
};
)

// play single synths

~makeCpuTest.(1, 1, 65)
~makeCpuTest.(2, 1, 65)
~makeCpuTest.(3, 1, 65)


// limits on my desktop

~makeCpuTest.(1, 23, 65)
~makeCpuTest.(2, 27, 65)
~makeCpuTest.(3, 18, 65)

// limits on my laptop

~makeCpuTest.(1, 25, 65)
~makeCpuTest.(2, 22, 65)
~makeCpuTest.(3, 18, 65)



// limits on my laptop with higher hardware buffer size
// after reboot 

(
s.options.maxNodes = 1024 * 64;
s.options.memSize = 8192 * 16;
s.options.numWireBufs = 64 * 16;
s.options.hardwareBufferSize = 4096;

s.reboot;
)

~makeCpuTest.(1, 27, 65)
~makeCpuTest.(2, 30, 65)
~makeCpuTest.(3, 27, 65)

Greetings

Daniel

Sam_Pluta · August 27, 2020, 3:46pm

One more reflection on this. Shouldn’t the SuperCollider language app run on multiple CPU cores? The server should only run on one, but as far as I understand, the lang shouldn’t have the same restrictions. But on my machine, it definitely seems to be constrained to one core. In a very cpu intensive process, it is maxing one core out at 100%, but not spreading the love around at all.

Is this something that is a conscious part of the design? I was definitely under the impression that this didn’t used to be the case.

Sam

scztt · August 27, 2020, 8:22pm

Most scripting languages do not support simultaneous execution on multiple cores, and SuperCollider is no exception. If you’re in a Python or Ruby context, this is referred to (or complained about as) the GIL, global interpreter lock. My guess is that this is mainly because one big advantage of scripting languages is the relative safety from hard heap-corrupting crashes and undefined behavior, and multi-threaded execution is one of the easiest ways to trigger these sorts of issues. The major exception here is the Java JVM, which supports multithreaded execution - so JVM implementations of other languages (e.g. JRuby, JPython) support multithreaded execution, though I don’t know how deep the support goes.

If you truly have long-running, CPU intensive tasks to run in SC, you might consider using something like the API quark to communicate with a pool of sclang processes running in the background, which could pick up pieces of work and return the result. You wouldn’t get easy sharing of memory between them, but then shared memory between threads in a multithreaded language is rarely trivial either - there’s no free lunch. I suspect that a simple “shared worker pool” quark would be very popular and widely used

dkmayer · August 27, 2020, 8:54pm

@Sam_Pluta, just curious, what kind of CPU-heavy language operations do you have in mind?
I think it has not been mentioned in the discussions so far: preprocessing of whatever might often be a neglected option. As James calls it eager vs. lazy evaluation – big chunks of data can be calculated and stored within fractions of a second and then be used, e.g., in a Pseq or Task. Analogously server-side: NRT seems to be not widely used – rendering of some minutes of sound might also be done very quickly before playback. And the transition from eager/RT to lazy/NRT can be differentiated by occasional automatized calculation of data.

Spacechild1 · August 27, 2020, 9:01pm

Small nitpick, but I think that interpreter lock and multi-instance support are orthogonal issues. Python and Ruby just happen to be designed as singletons, so the interpreter lock is necessarily global. However, it is possible to conceive a language with multi-instance support and with a dedicated lock for each interpreter instance.

Actually, interpreter locks can be useful for concurrency because C extensions can temporarily release the lock and do some work (e.g. file I/O, heavy matrix multiplications) while the interpreter continues to run code from another thread.

Sam_Pluta · August 27, 2020, 9:57pm

Well, the issue I was running into today was that I had a Dictionary with 1.25 million points in it, and I was trying to traverse it. It was difficult to tell if it was just going slow or crashing. I think maybe it was both. The CPU was sitting at 98%, which I thought was odd. Now I know that a Dictionary cannot be that big, haha.

The explanation Scott gave about the lang clarifies the situation.

But I am all about multiple servers and NRT these days. I got these points by running 20 parallel servers doing MFCC analysis. It took 2 days, but it made it!

I guess the summary of my issue is that the big data needs big data, and big data needs lots o cpu.

jamshark70 · August 29, 2020, 5:07am

It doesn’t even need to be a crash.

fork {
    a = [... some 10 element array...];
    a.do { |item|
        if(item.even) { "even" } { "odd" }.postln;
    };
    0.001.wait;
}

SC’s non-preemptive and non-parallel threading means you can have 100% confidence that no other thread or function can alter a between assignment and the completion of the do loop.

If SC’s threading supported parallel processing, then another thread/function running on another core could, say, delete items from a while the loop is running… but the loop is still doing 10 iterations, so at some point you’d get nil and an error… and when you read the code in isolation, it wouldn’t be clear where the bug is – the code looks right but would magically fail under the wrong concurrent scenario.

That is, if you think SC is tricky now, it would be much much trickier with a parallel-processing interpreter.

I saw a lecture recently about functional programming. One of the points was that you might not need a functional language like Haskell to do functional programming – that FP could also be seen as a style of programming that focuses on function inputs and return values and (crucially) avoids mutating object state, preferring to return a new object reflecting the change. Currently Array:removeAt mutates, but if it were written removeAt { |i| ^this.select { |item, j| j != i } }, this would be functional style – there would be no change to the state of the array being processed by a.do and no error even under concurrency.

But that would chew more memory and add garbage collection load… so it may not be really practical.

hjh

dkmayer · August 29, 2020, 1:24pm

Certainly true!
Now you’ve solved your problem – but as you mentioned traversing it might pay investigating the travesing procedure in detail, detecting maybe CPU-consuming (sub) loops, considering alternative data types etc. E.g. the difference between Dictionary and IdentityDictionary can be huge.

// here traversing with IdentityDictionary is ca. 3.5 times faster!

x = Dictionary[ ("alpha" -> 1), ("beta" -> 2), ("gamma" -> 3) ];

y = IdentityDictionary[ (\alpha -> 1), (\beta -> 2), (\gamma -> 3) ];


{ 100000.do { (["alpha", "beta", "gamma"].sum { |u| x[u] }) } }.bench

{ 100000.do { [\alpha, \beta, \gamma].sum { |u| y[u] } } }.bench

It might also be an option to store in an array in parallel.

Sam_Pluta · August 29, 2020, 4:55pm

It might also be an option to store in an array in parallel.

Well, this one is weird, because I didn’t realize a Dictionary actually has an collection in it, since it inherits from Set.

So:

x = Dictionary[ (“alpha” → 1), (“beta” → 2), (“gamma” → 3) ];
x.array;

This is why keyValuesDo is more efficient than traversing by key. It goes through the collection rather than looking each value up with the symbol.

Sam

jamshark70 · August 29, 2020, 11:58pm

Yes, using Dictionary with String keys will be just about the slowest lookup that exists in SC, because string equality checking will be much much much slower than symbol equality checking. I’m afraid you accidentally stumbled into a worst case for performance, with a huge collection where performance matters.

Takeaway: Don’t use Dictionary when you need speed. Dictionary is good for small collections where you need keys to be looked up by equality rather than identity.

If you can do without symbolic keys, an array would be best because it’s constant-time lookup (whereas even IdentityDictionary needs to do some scanning for each lookup).

And yes, the iteration methods bypass the lookup and attendant searching altogether.

hjh

VIRTUALDOG · September 1, 2020, 6:15am

I’d like to add that Dictionary and IdentityDictionary are both linearly probed hash maps backed by an array (as Sam pointed out), with a load factor of 0.25. This means that if you have N key-value pairs in the Dictionary, that’s being stored sparsely in an array of size 4N to 8N. So, the effect that memory locality has on performance is also tangible when iterating over a Dictionary vs an Array.

If you can do without symbolic keys, an array would be best because it’s constant-time lookup (whereas even IdentityDictionary needs to do some scanning for each lookup).

Array will definitely be fast, but assuming a good hash function, (Identity)Dictionary’s lookup is amortized constant time in the number of elements in the container.

Daniel and James have both given some very reliable rules of thumb for using data structures; those all match up with my experience in SC and other languages.