CPU headroom, multi threading

Ah! @Sam_Pluta has already posted how to use multiple servers efficiently.
Sorry!

And, I have a question:

Allocation of specific applications to a specific thread of CPU is done by the OS, not by the user.
In this regard, which is more efficient between using supernova and using multiple scsynth servers?
I ask this because using supernova seems to show a flatter CPU usage, but I am not entirely sure…

IIRC supernova requires you to group processes that that read each-other’s outputs together into Groups and processes that don’t into ParGroups - these can (should) be nested as needed as needed. So it requires a little care - but I’m getting some pretty great results just now goofing around…

I find that Supernova is more efficient with large, complex, routing-heavy synths and graphs even when it’s only being run on a single thread (e.g. without ParGroup).

Here is a comparison between using ParGroup and not using it:

Screenshot 2024-05-27 at 20.19.40

Screenshot 2024-05-27 at 20.19.16

The elements in your ParGroup aren’t computationally expensive enough to see a benefit from multithreading.

Switching threads is actually very expensive, it is only when the things inside ParGroup are themselves are computationally expensive that you’d see a benefit. Try making a synth with hundreds of comb filters, or placing groups inside the pargroup each with 100s of these small synths in, you should see a difference then.

Measuring cpu usage like this isn’t simple either as (I think) the main thread does a ‘spin lock’ until the other threads are done, meaning it will report near to 100% usage while they are processing, but it is actually doing nothing meaningful. Also, cpu usage isn’t important really (well it is for energy consumption…), what matters for audio is how long it takes to perform the calculation — when waiting for memory to load, the cpu will report a low percentage, but might actually be unable to process the data fast enough.

1 Like

it is only when the things inside ParGroup are themselves are computationally expensive that you’d see a benefit.

Yes! See also the last paragraph in Question on Graph / Topological Sort - #16 by Spacechild1

Also, cpu usage isn’t important really (well it is for energy consumption…), what matters for audio is how long it takes to perform the calculation

True! For benchmarking realtime audio performance, the criteria really is: “how many X can I run simultaneously before getting dropouts?”

1 Like

What’s your OS? How many cores/threads do you have? And what’s the reason for s.options.threads = 10?

The OS is macOS 14.5, and the machine is Macbook Pro 2021 (m1 max)

AFAICT, the M1 Max has 8 performance cores and 2 efficiency cores, so you don’t want more than 8 DSP threads!

In general, I can imagine that Supernova does not work all that well on Apple Silicon machines because of the new CPU design (performance cores + efficiency cores). I think there is no guarantee that the DSP helper threads will always run on the performance cores and we would really need Use audio workgroups for supernova DSP thread pool. · Issue #5624 · supercollider/supercollider · GitHub.

Anyway, I would be curious to see your results with a lower number of threads, e.g. 8, 6 or 4.

2 Likes

Here is demo video:
demo video

and screenshots:



A small addition - this kind of testing should always be done with a laptop plugged in - when running on battery power, CPU clock speeds are much more uneven so the effective performance you get for audio is much less than the “average” CPU clock speed at any given time. This has historically been a problem with Macs, as they often make more high-power processors feasible on their laptops by extremely aggressive power management - which is fine for most things, but can have an outsized impact on audio processing.

2 Likes

Wanted to just post a benchmark that shows supernova’s strengths a little bit better.

12 × AMD Ryzen 5 5600X 6-Core Processor

I’m defining stable as ‘does not produce xruns while opening and closing the server meter over a one minute period’.

server type num threads num synths stable cpu% per core perf delta relative perf per core
scsynth 1 20 stable 37 100% benchmark
nova 1 19 stable 35 95% 95%
nova 2 30 stable 41 150% 75%
nova 3 33 stable 40 165% 55%
nova 6 37 stable 48 185% 31%
nova 12 37 unstable 48 185% 15%

Things to take away.

  • Using the virtual/hyper cores the cpu reports hinders performance (nova defaults to six not twelve so its fine).
  • Supernova is roughly equivalent to scsynth when run on one core.
  • Diminishing returns are quickly reached.
  • Xruns occur far before reaching a CPU limit, implying that memory access or thread synchronisation is the bottleneck.
  • Across the used cores, supernova’s cpu usage is flat.

If any one else wants to run this and report back, here it is!

(
var numThreads = 6;
var numSynths = 37;
var useSupernova = true;

if(useSupernova) {
	Server.supernova
} {
	Server.scsynth
};

s.options.threads = numThreads;
s.options.memSize = 2 * 1024 * 1024;

~fftsize = 2048;

~createIR = {
	var ir, irbuffer, bufsize, buffer;
	ir = [1] ++ 0.dup(100) ++ (
		(1, 0.999998 .. 0).collect {|f|
			f = f.squared.squared;
			f = if(f.coin) { 0 }{ f.squared };
			f = if(0.5.coin) { 0 - f } { f }
		} * 0.1
	);
	ir = ir.normalizeSum;
	
	irbuffer = Buffer.loadCollection(s, ir);
	
	s.sync;
	
	bufsize = PartConv.calcBufSize(~fftsize, irbuffer);		
	buffer = Buffer.alloc(s, bufsize, 1);
	buffer.preparePartConv(irbuffer, ~fftsize);
	
	s.sync;
	
	irbuffer.free; 	
	buffer;
};

s.waitForBoot {
	Window.closeAll;
	SynthDef(\bigOne, {
		var part = PartConv.ar(Impulse.ar(340), ~fftsize, \irbuf.kr(-1), 0.5);	
		Out.ar(0, part)
	}).add;
	
	s.sync;
	
	~g = ParGroup();
	
	s.sync;
	
	~irbuf = ~createIR.();
	numSynths.do{ 	
		Synth.head(~g, \bigOne, [\irbuf: ~irbuf])
	};	

};
)
3 Likes

Which OS?

As a side note, there is a pending PR of mine that improves Supernova performance on Windows (and under certain circumstances also on Linux): Supernova: thread affinity fixes/improvements by Spacechild1 · Pull Request #5618 · supercollider/supercollider · GitHub

Linux, current dev branch. Oh yes I saw that!

If supernova is, as it seems always at least as performant as scsynth what is scsynth for?

AFAICT Supernova was indeed intended as a replacement for scsynth. However, it hasn’t been tested all too well and had many bugs, which made many people hesitate to use it. I have fixed a few nasty bugs in the past, but there are still some open issues.

Also, Supernova has been designed as a singleton, i.e. there can only be a single instance per binary. This means that you cannot reasonably use it in an audio plugin, for example. Scsynth, on the other hand, allows multiple instances via libscsynth, similar to libpd.

This thread made me try and build an multithreaded UGen of 8 sawtooth waves using std::thread in C++. Long story short…this is not efficient at all! I’m happy I got to work though.

Sam

IIRC TimB’s thesis about Supernova has a section discussing things like thread wake-up and synchronization times - I remember these being useful for me even when I didn’t understand the deeper details, because it put a scale on the performance cost of having a multithreaded pipeline at all.

MacOS now has kernel-level ways to synchronize multiple threads to the main audio thread (or, at least, the audio workgroup stuff implies that they’re doing a better job at synchronization…). I’m curious whether there’s any significant performance improvement to be had with a Supernova-like architecture that properly uses these.

Somewhat tangentially, I’ve been cautiously in the market for a new laptop and I was surprised to find that ONE laptop review website actually runs realtime audio tests for most of their reviews. I don’t know the test they’re running very well, and it’s a Window-only thing so it’s unlikely to apply cleanly to Linux or Mac - but this information is still important, and can make a huge difference for an audio laptop, and I don’t really know of any other way to find this (apart from borrowing a laptop and running tests yourself). I’ve literally brought a USB stick in to an Apple store in the past, to try to run SuperCollider dropout tests and see what a 3k laptop upgrade was actually getting me in terms of, you know, grain count :slight_smile: - this is a bit easier.

For example: https://www.notebookcheck.net/Lenovo-ThinkPad-T16-G2-in-review-Quiet-office-laptop-with-long-battery-life.739438.0.html#c10090093 (look for the DPC Latency / LatencyMon test)

(If anyone wants to recommend a solid Linux-capable laptop with a big GPU, please DM me!)

Writing proper multi-threaded code, especially in a (soft)realtime context, is far from trivial. In particular, you can’t really create and join threads on the fly, instead you need to maintain a thread pool and (lockfree) task queue. You’d also need to set the appropriate thread priorities, possibly pin threads to specific cores, etc. Also, as @scztt already hinted, you’d need to make sure that the individual workloads are large enough to outweigh thread wake-up times and context switching costs.

That being said, it is possible to do multi-threaded DSP even within a UGen (or across multiple UGens), but you really need to know what you’re doing. For example, VSTPlugin has an option for multi-threaded plugin processing which can reduce CPU load significantly. (It essentially offloads the plugin processing to helper threads, keeping the main audio thread free for other tasks.)