Real-time Audio Processing (or, why it's actually easier to have a client-server separation in Supercollider)

scztt · April 26, 2020, 2:27pm

In Client-Server Architeture Limitation Cases, I mentioned that I could go into depth about why the SuperCollider architecture is the way it is, and why it allows real-time audio processing. I’ll try to give a non-computer-scientist introduction as best I can… If you’re not interested in the expository part, there are some practical tips related to performance at the end.

What does “real-time” mean for audio?

For our purposes, we can consider real-time to mean that we have a fixed - and very short! - window of time where we can process audio in order to have it played back by our audio device. This is how audio playback works for applications and plugins:

Periodically, the audio driver calls the application / plugin to ask it for the next chunk of audio. When it does this, it will also usually provide things like:
(a) input audio, e.g. from a microphone,
(b) event information, things like MIDI events.
The driver calls the application a little bit before it needs the chunk of audio, so there’s enough time for the application to process it.
The driver may ask for larger chunks of audio, or shorter chunks - the size of the chunk is usually called the buffer size.
A further ahead the application is called, and the larger the buffer size, the more “out of date” the audio input and event information - (a) and (b) above - is when processing happens - so for many (but not all…) cases, we’re trying to make this time as short as possible. This delay is usually called latency.
The application will process all of the events, which might entail things like starting/stopping a synth, changing an internal parameter, etc.
Then, the application will run do the audio processing required to provide the requested chunk of audio. This could include rendering waveforms, running filters - basically all the things you’d put in a SynthDef.
If the application doesn’t finish processing audio by the time the audio driver needs it, it won’t have any audio to play - usually, in this case, it will play silence. This results in a drop-out.

So, how much time does the application have to provide it’s finished, processed chunk of audio? Not very long - a short but still pretty standard buffer size is 256 samples, which translates to about 5 milliseconds (at a 48k sample rate). This is optimistically assuming that the application is the ONLY thing processing audio - or really doing anything - on the computer. So, the practical time can be much less than this.

If it does not finish processing and provide some audio in less than 5ms, you hear a drop-out. This is our “real-time” deadline.

Worst-cast scenarios and audio vs. video

Drop-outs are very noticeable in audio, and they sound bad - compare this with video, where a few dropped frames at 60fps may be only a mild annoyance, or not noticeable at all. Different contexts might require different limits in terms of what we would consider bad: if I’m practicing at home, one drop-out per minute would get annoying enough for me to care. If I’m playing my first-ever gig at Berghain, even one drop-out in 4 hours would be unacceptable.

So, when we talk about real-time audio, what we care most about is the worst-case scenario: if our processing callback is super fast and smooth 99% of the time, but slow just once, that could be one drop-out too many for us.

This leads us to one of the counter-intuitive things about real-time audio: we don’t really care too much about how fast something is in the normal cases - we mainly care how fast something is in the 1% case where we miss our deadline. You could improve the 99% cases as much as you want, but as long at the 1% case is still slow, you’ll still drop a big old glitch at 110dB to a club full of dancers.

Avoiding the worst-case-scenario during your first Berghain gig

We can help to avoid missing our deadlines by doing two things:

Making sure we only do the smallest amount of work possible when the audio driver asks us to process the next chunk. We should do anything we can ahead of time, so that it’s ready when the call comes.
We should, at all cost, avoid things that take unknown or widely varying amounts of time. Remember, we care most about the worst-case scenario - so if we’re calling a function renderKickDrum() takes between 0.1ms and 2ms, the only number we care about is 2ms.
We should avoid waiting for anything else when processing audio.

It turns out that, when you’re dealing with time spans of less than a millisecond, a lot of things fall into group 2. Allocating memory, communication with threads, devices, or processes, reading things from disk, even accessing in-memory data like a sample you’ve already loaded from disk. Generally, it’s best to assume that any function you call - unless you wrote the code yourself, you’ve read through the source exhaustively, or it’s part of a real-time safe library - could potentially be in group 2, and thus shouldn’t be used when you’re processing audio.

It may be that, for some function you’re calling, the “worst-case-scenario” is not really that bad. But, tracking down these problems is notoriously difficult - real-time drop-out bugs like this can easily be multi-day or even multi-week debugging ultra-marathons. It’s easily to imagine why - first, you have to reproduce the bug, which might mean sitting around playing audio for hours waiting for a drop-out to occur. Then, you have to infer after the fact what part of your audio process might’ve taken an unexpectedly long amount of time.

A good audio architecture reduce the chances of having hidden “worst-case-scenarios” by strictly limiting what can be done during an audio callback. Even communication has to be limited: imagine if your audio process had to request something from your UI, and then wait for a response (category 3 above). Now, the worst-case-scenario time of that UI communication is added to the worst-case-scenario for your audio process. It’s best to imagine it along these lines: anything you touch in your audio process potentially adds it’s worst-case-scenario to the worst-case-scenario of your audio process. If you touch a bunch of other complex systems in your audio process… well, you now have to worry about each one of their worst-case-scenarios as well.

Usually, good audio architectures deal with this by addressing the three categories above:

Run anything you can ahead-of-time, and have it ready for the audio process when the call comes. This is usually done on some kind of background thread, or in some other way that won’t interrupt the audio process.
Stick to running a very small, sanitized subset of code in your audio process. Only things that are really-truly-for-sure real-time safe should ever be done when processing audio.
Use real-time safe communication methods: lock-free queues, single-writer-single-reader patterns, etc. This may seem like a very small and specific computer-science-y thing, but audio thread communication messages are extremely difficult to get right. Good audio architectures generally solve this problem once, correctly, and then force all communication to go through this path.

Okay, but why is there a separate audio server in SuperCollider?

The SuperCollider “server” encapsulates the core functionality I just described above:

It can run long operations ahead-of-time (mainly, loading SynthDefs and Buffers) and then have the results ready for audio processing.
It provides a well-defined library of real-time-safe code you can run, in the form of unit generators, UGens.
It provides a way to communicate with audio processing without waiting, or increasing injecting more “worst-case-scenario” time - OSC messages.

The server builds a pretty impregnable wall around this functionality. It does this so that it can provide a very strong guarantee: if you stick to the tools it has provided, you don’t have to worry about any of the nightmarish complexities of meeting real-time audio deadlines. You won’t need to pick-and-choose which libraries you can use when processing audio, or which function calls you can or can’t make, or who you can communicate with.

Notably in SuperCollider 2, you could use a lot more sclang functionality “on the server”. This meant power - you were not limited to the kind of “throw-a-message-over-the-wall” OSC communication. But it also meant you, as a programmer/artist, had to pay a lot of attention to what kinds of things you did, exactly how you did them. When there’s no wall, it’s extremely easy to pull in little bits of complexity without realizing it: each bit of complexity comes with it’s own worst-case-scenario, which gets added to the worst-case-scenario of your audio processing call. When you start getting drop-outs, it’s then your problem to figure out which worst-case-scenario is the one that pushed you over the edge and ruined your otherwise masterful Berghain set.

This is why it’s actually easier and more user-friendly to have a separate audio server

The walls and limitations in place that constrain how you interact with the SuperCollider server are just the architectural limitations of interacting with any real-time audio process, as I described above.

Good C++ audio architectures have the same walls: they may not use the same abstractions (e.g. OSC messages, SynthDefs), but if it’s well designed, you’re highly likely to run into equivalent constructions. At it’s best, SuperCollider makes the boundaries between audio processing and… everything else exceedingly clear (in most cases), which is truthfully not something I can say about a lot of C++ audio architectures.

So, it’s best to consider the client-server separation like this: anything you find that you can’t do easily on the server, is probably something that you SHOULDN’T do - at least, not unless you’re really interested in mucking with lock-free queues, thread primitives, and asynchronous task processing… all of the things that you’d need to start worrying about once the wall comes down.

In theory, nothing makes it inside the wall of the server without being totally sanitized and safe, which means you can have the freedom to do whatever creative / wild / totally ill-advised things you want outside the wall, and you will continue to have smooth audio playback while you do them.

Practical take-aways

Design for the worst-case-scenario

If you’re working on a piece of music, and the final section has 10 simultaneous reverbs running at once, you might as well run those reverbs the whole time. If your piece can’t make it through the 10-reverb-crecendo without drop-outs, it doesn’t really matter if the rest of it is super-efficient. In fact, an approach of leaving everything running can actually help you find potential problems much sooner.

Of course, there are good reasons to start and stop synths, especially for more complex multi-part pieces: but the less complexity you have, the easier it will be to discover and diagnose problems.

Have things ready in advance

Even though SuperCollider is good at e.g. loading Buffers in the background, it doesn’t mean it can’t still get itself in trouble. The more you can load buffers and define SynthDefs far ahead of when you need them, the less you run the risk of hitting drop-outs or performance problems later on.

Ignore average, worry about peak

In an audio context, “average cpu usage” means next to nothing - as I mentioned earlier, an hour of “10% average CPU” and 500 milliseconds of “95% peak CPU” still means noticeable drop-outs. Pay attention to the peak CPU usage, and in particular the peak-of-the-peak, e.g. the highest your peak CPU usage jumps over time. This is your worst case scenario, and this is the one that borks your set.

Some UGens will cause more widely varying peak CPU usage than other - for example, reading to and writing from in-memory Buffers can take a surprisingly variable amount of time when you’re thinking in sub-millisecond timeframes. You can sometimes see this manifest with e.g. complex sets of long delay lines. If you’re having performance problems, focus on finding UGens that cause more spikey Peak CPU usage, rather than one that seem to contribute to the overall average. It’s often easy to brute force this by simply walking through your synth, replacing likely culprits with something trivial like DC.ar(0), and observing the result.

Adjust your latency and buffer size

The latency of a server (e.g. Server.default.latency) is how far in advance sclang schedules events, and thus how long the server has to make sure all the required resources are ready. This also represents a time delay between when you tell the server to do something, and when you actually hear the result.

Buffer size (Server.default.options.hardwareBufferSize), as mentioned earlier, relates to the size of the chunks of audio requested from the server - this also incurs some delay, since these chunks have to be requested before it’s time to play them.

Increasing either one of these - but mainly the buffer size - increases the tolerance the server has for “worst-case-scenarios”, at the cost of being slightly later to respond to incoming MIDI events, or process input audio. Latency only becomes important in cases where you’re creating and destroying lots of synths, are using more complex synths or UGens, or e.g. long delay lines. I’ve never had to make my latency higher than the default of 0.2.

The good news is - if you’re not using an external controller or sensors, and not processing any live input, you can turn either of these way up with no real negative consequences. If you DO have live input or external controllers/sensors, you’ll have to find a balance so that e.g. your input audio isn’t noticeable delayed, or your MIDI keyboard doesn’t feel sluggish when you play it.

fmiramar · April 28, 2020, 12:59am

@scztt thanks a lot for this!!!

If you (and others here) could also clarify a litle bit about the difference between .hardwareBufferSize and .blockSize it would be great!!

IMO this text should be blended somehow into SC doc (maybe some chuncks of your text into diferent parts of SC doc). After some discussion he on this topic, I will create an issue on github.

Spacechild1 · April 30, 2020, 11:21pm

The blockSize is the number of samples that the Server processes for each DSP tick, where it dispatches scheduled OSC messages and then walks over all the UGens and calls their processing methods. It is also the control period size. You should probably leave it at the default of 64 samples, unless you have a good reason to change it.

The hardwareBufferSize is the number of samples requested by the audio device. It is usually larger than blockSize, so the audio callback will compute several DSP ticks in quick succession. Larger hardware buffer sizes allow for more fluctuation in CPU load:
If the hardware buffer size is 64 samples (the same as the block size), each DSP tick has to be completed in less than 1.45 ms (assuming 44.1 kHz sample rate).
If the hardware buffer size is 256 samples, 1 or more DSP ticks can take longer than 1.45 ms, as long as the total duration of all 4 DSP ticks (4 * 64 = 256) is less than 5.8 ms (256 / 44.1). This is especially relevant for audio algorithms with very uneven CPU load, such as FFTs. (A 1024 point FFT with no overlap has to buffer for 15 ticks and then perform a heavy computation in 1 tick.)

dkmayer · May 1, 2020, 1:16am

One good reason could be feedback, with some of the standard ways to do feedback (LocalIn / LocalOut, InFeedback) the blocksize (control duration) is the minimum feedback delay. A smaller blockSize can lead to dramatically different results.