Mac M1 Performance comparison

Xon · March 27, 2022, 9:01pm

Hello,

By testing on a new computer Mac M1 Max 16", I did a little comparison with my old MacBook Pro 15" 2.9GHz Core I7 2016.

And I was a lttle bit surprised by the results below.
The server performs better with the new computer M1, and even better when compiled with SC compiled with Apple Sillicon.
But, although it seems the language is a little bit faster with a function, by playing a complex pattern with lots of events and parameters, I notice a worse performance on the new computer M1 concerning the language (in bold).
It is difficult to me to share a simple code since it implies a lot of things to install, but has anyone experienced this bad performance with tasks and patterns on the language side of SC with M1 ? And can explain why ?
Or maybe someone knows a simple code for testing the CPU of the language (Activity Monitor) when playing with patterns or tasks ?
to confirm the bad performance of M1 with the CPU of the language by playing patterns.

3 benchmarks :

1 = Execution time of a function (.bench)
2 = CPU of the Language by playing a complex pattern (Activity Monitor)
3 = CPU of the server (SuperCollider)

On 3 computers/versions of SuperCollider :

Apple Sillicon SC on MacBook Pro 16 M1 Max 2021
→ 1 = 0.44 / 2 = 32%-35% / 3 = 12-20%
Intel 3.12.2 SC on MacBook Pro 16 M1 Max 2021
→ 1 = 0.55 / 2 = 32%-35% / 3 = 22-28%
Intel 3.12.2 SC on MacBook Pro 15 2.9GHz Core I7 2016
→ 1 = 0.50 / 2 = 22%-30% / 3 = 35-45%

Many thanks,

Christophe

scztt · March 28, 2022, 9:31am

A word of warning: CPU usage cannot be used to measure performance ESPECIALLY on laptops. Laptops will down-clock their CPU’s when cycles are not needed to reduce power consumption. An idealized laptop CPU would always be running at 100%, because it would operate at the exact clock speed required to get it’s work done and no more. So, what you might be seeing in cases where the M1 is showing higher CPU usage is simply that the hardware is e.g. more efficient at scaling CPU usage for short thread wake-ups, like the kind required for waking up sclang to generate events.

The function execution time measurement is probably a relatively good measurement though. The best performance test for the server is to come up with a simple-but-still-non-trivial Synth, and then measure how many copies of it you can run before you get audio dropouts.

Jose · March 28, 2022, 3:59pm

Hello,

I have been testing the native version of SC on M1 (MacBook Pro M1 Max) and I manage to launch more or less twice as much synthesis as on a MacBook pro intel i9 2018, very impressive! And I also manage to launch about 20% more synthesis (I have not made a precise calculation) with respect to the Rosetta version.
However sometimes, I see that it becomes slow when I send a lot of information to the scsynth server and it loses OSC messages. The problem is that it’s random, most of the time it works great and others it’s very slow… Looks like an OSC overflow. Another thing that happens with both, Rosetta and M1 version is that when I load a lot of sound files into buffers (about 800), sometimes it is very slow to load, much more than in intel, seems again an OSC overflow. I try using this code to see the scsynth message when done:

(
var paths= "/sound_folder/*".pathMatch.collect({|path|
	Buffer.read(s, path,action: {("load_SF"+path).postln});
});
)

Maybe some adjustments need to be made to make it completely stable on M1? I will also try to test it on another M1, but for now I don’t have access…

All the best,

José

vitreo12 · March 28, 2022, 4:22pm

Have you also tried to boot the server with the TCP protocol instead of UDP?

Xon · March 28, 2022, 4:50pm

Thanks,
@Jose I also see a great improvement of performance on the server side by comparing :

Apple Sillicon SC on MacBook Pro 16 M1 Max 2021
Intel 3.12.2 SC on MacBook Pro 15 2.9GHz Core I7 2016
I can load much more synths (almost the double).

@scztt But I still do not understand why in the Activity Monitor with the same task (patterns and synths), the CPU of the SC server has been divided by 2 (from 40% to 20% - which meant to me that I could load much more synths before hitting the top, which is the case), while the CPU of SC language has been multiplied by half (from 22% to 32% - which meant to me an under performance) ?
However, globally I remark an improvement on the language side, but the improvement seems to have been much more beneficial on the server side than on the language side.

Jose · March 28, 2022, 5:21pm

I just tested and it seems to be the same in TCP, I would have to do other tests. At the moment everything is working fine. I wonder if it is not the system that sometimes performs background tasks and scsynth loses its performance because of that especially in relation to OSC communication? Very strange…

scztt · March 28, 2022, 6:18pm

Going from 40% to 20% does not necessarily mean that you can load twice as many synths. For realtime audio especially, CPU power is not really a good indicator of how much you can do at once - or, at least, CPU power is a good guess but there’s a VERY large margin of error.

There are a few possible explanations of audio server vs language usage differences. It might be that one of the two is more memory bound, and (for example) the speed of memory and caches between the two testcases is very different. Faster memory would make memory-bound tests run much faster, while CPU bound tests would look about the same. I believe M1’s actually have cores with different processing power, so it could be that you hit a scenario where sclang was running on a “slower” CPU core? I’ve read a few articles about this, but I haven’t looked closely at how this is handled.

MarcinP · March 28, 2022, 6:40pm

I’d like to add circumstantial evidence from my testing (MBP 2020 with the “regular” M1, 4 efficiency + 4 performance cores):
I also noticed ~15% difference when running scsynth on M1 between Rosetta and native build. My test was to run as many synths as possible before noticing dropouts and I was able to run about 15% more synths on the native build. Unfortunately I can’t find the exact numbers atm…
This is not a 100% accurate test (the boundary of dropout is “soft” - is one dropout enough? I think I settled for >1 dropout / second or something like that) but probably better than looking at the CPU meter.
IMO 15-20% penalty for running an emulated code is a very good result.