Scsynth underperforming compared to other Jack clients

Sorry if I missed, but did you also compile 3.11.2 yourself? I’m asking because you can only really compare versions that have been compiled with the same compiler and settings.

“Methodology” may not be exactly the right word.

I guess the key question is, what degree of CPU-usage-before-glitching would you accept? What is the goal?

The discussion keeps coming back to the performance of the test app. My opinion is that the test app usefully shows that your system can run real-time audio up to 80% load, under specific and limited circumstances, but that 80% is not a benchmark that every audio app will necessarily reach.

I’m willing to be proven wrong, but I think a better focus for this thread would be to improve the stability to a usable level.

Now, how to do this exactly, I’m not sure. I can only report my experience. I tried a few times in vanilla Ubuntu to install a low-latency kernel and run the tuning script, but I could never get it right. Even playing an audio file would occasionally glitch. Eventually I gave up and installed Ubuntu Studio and those problems went away.

I never got anywhere near 80%, but in a live coding show, I can overload the texture to the point where it’s hard to control the music’s direction – the struggle for my musical style is keeping the texture cleaner and more stripped down. It’s enough.

I fully agree that 25% is below what I would like, but it’s worth considering what is a good balance between idealism and pragmatism.

hjh

Yes, I did compile all version myself in order to use the -DNATIVE=ON flag with cmake.

Well, on macOS I can reach 80%, so something close to that would be nice. I am using both macOS and Linux at the moment and would like to switch to Linux for good, as the newer macOS versions don’t cut it for me any longer.

Same with SC 3.11.1. Working my way up to see when something changed.

I think you are certainly on to something! Once you find the version that first introduces the performance regression, I will have a look at the changes.

In an attempt to reproduce my results communicated earlier (e.g. at the beginning of this thread), I became aware that I cannot, which is utterly frustrating. I have been careful not to change too many things at a time, but somehow I must have been not systematic enough.

What I have not documented is the room temperature, which may also be an important variable. I may have performed earlier tests at a higher room temperatures as we had a little heat wave here in Graz recently.

When monitoring the CPU frequencies during the tests, I noticed that the governor (set to performance) is still scaling the frequencies in a wide range between 3.9 and 4.7 GHz (some 20%).

All this shows that collecting data to diagnose my problem is not trivial. This is why I have redone all the tests and documented them more carefully. This should allow me to rerun them at a later time to check if the results still hold. I may also have to reinstall my system to exclude any configuration errors I may have committed.

Here comes a table with my latest results, using two different kernels and Jack settings. The comparisons with earlier versions of SuperCollider didn’t show any significant differences.

SuperCollider: git checkout tags/Version-3.11.2
compiled with cmake flags -DCMAKE_BUILD_TYPE=Release -DNATIVE=ON
Audio interface: USB Behringer UMC202HD
Room temperature: 28 degrees Celsius

Kernel: 5.4.0-77-lowlatency

  Jack: 64/2
    client 15500: 75%
    scsynth  600: 45%

  Jack: 256/2
    client 18000: 80%
    scsynth 1050: 67%

Kernel: 5.8.0-59-generic

  Jack: 64/2
    client 15000: 70%
    scsynth  500: 35%

  Jack: 256/2
    client 19000: 85%
    scsynth 1100: 70%

Jack configuration is in Frames per Period / Periods per Buffer
client stands for: extra-load-jack-client <n> (find the source here)
scsynth stands for: sclang sine-test.scd <n>

The numbers <n> for client and scsynth are the highest settings I have found iteratively that don’t produce an xrun during a period of 1 minute.

sine-test.scd contains the following code:

(
var num = thisProcess.argv[0].asInteger;

Server.default.options.maxNodes = 2048;

Server.default.waitForBoot({
	SynthDef(\sin10sn, { |out = 0|
		Out.ar(out,
			(SinOsc.ar(
				NamedControl.kr(\freq, Array.fill(10, 440))
			).sum() * 0.0005).dup()
		)
	}).add();
	Server.default.sync();
	n = Array.fill(num, {
		Synth.new(\sin10sn, [freq: Array.fill(10, { exprand(200, 800) })])
	});

	loop {
		[s.avgCPU, s.peakCPU].postln;
		1.0.wait;
	};
});
)

What happens if you use higher Jack buffer sizes? In both tables I notice that the higher Jack buffer size produces results that are more close to each other. I would assume that this trend continues with higher buffer sizes. This makes sense because a higher buffer size makes it easier to deal with variations in CPU load.

I think that @jamshark70 has made a very good point that the loop in your test client performs very uniform work. In a realtime context we don’t care so much about average performance but about worst case performance. For example, a RT allocator is not necessarily faster than a non-RT allocator on average, it just has a well defined upper bound. Maybe try to add some actual DSP code to your test client to produce a more realistic scenario.

It would also be interesting to see if there’s a significant difference in performance between scsynth and supernova with the same sclang code.

It would also be interesting to see if there’s a significant difference in performance between scsynth and supernova with the same sclang code

Supernova crashes due to bad alloc, in different version, with James test…

I am not sure if you were asking me, @lucas. In any case, my test machine is not a laptop but an Intel NUC Mini-PC. When I run the test, lscpu reports around 4 GHz.

It seems to perform well with that number. The hypothesis is that throttling degrades performance the same way turbo does. Something I noticed y my laptop, if one core gets loaded, temperature raises to 90~100° and the frequency goes down to 2ghz top. That also affect the DSP load percentage, it seems to constantly adapt to the top frequency. So, it’s very difficult to measure real performance in terms of DSP load. That’s kind of complex and sad, I can’t know what I’m buying, it is supposed to be better but is not.

And for that matter, I had a cpu of 2 or 3 previous generations that performed at 2 ghz and 60 degrees without fan noise. I can’t explain how disappointing it was to spend that money.

It won’t. Supernova does no parallel processing at all, unless you put nodes into a ParGroup.

Without ParGroups, supernova might actually clock in a little slower than scsynth.

Revised for supernova (edited to remove an unrelated experiment):

(
var num = thisProcess.argv[0].asInteger;
var group;

Server.supernova;

Server.default.waitForBoot({
	SynthDef(\sin10, { |out = 0|
		Out.ar(out,
			(SinOsc.ar(
				NamedControl.kr(\freq, Array.fill(10, 440))
			).sum()).dup()
		)
	}).add();
	group = ParGroup.new;
	Server.default.sync();
	n = Array.fill(num, {
		Synth.new(\sin10, [freq: Array.fill(10, { exprand(200, 800) })],
			target: group, addAction: \addToHead)
	});

	{ ReplaceOut.ar(0, In.ar(0, 2) * 0.001) }.play(target: group, addAction: \addAfter);

	loop {
		[s.avgCPU, s.peakCPU].postln;
		1.0.wait;
	};
});
)

I couldn’t reproduce this crash. But I have seen supernova crash when confronted with large numbers of groups being created and reordered (which is why I don’t dare use it in a show anymore).

~70% is already quite excellent. It’s indeed puzzling that you couldn’t get this at first. I can understand how that would erode confidence in the system, leading to more testing to be sure these results will hold.

But… bumping it up a few more percentage points may quickly become a matter of diminishing returns.

Also, comparing against your Mac results (since you said earlier that you wished to match the performance on your Mac):

… 2500 or 3000 more sines in Linux, at a lower CPU reading. Hm. It occurs to me that the Mac test may have been at 64 samples, so I may be drawing a wrong inference.

hjh

I find these tests interesting, but I’m not sure how informative they are.
SinOsc (for instance) does a lot. It checks inputs, has interpolation checks, etc
scsynth does a lot - it manages (even if you aren’t using them) a chunk of memory for control and audio buses, etc.
So I’m not sure what the comparison is really showing?

All these comparisons were a (possibly avoidable) detour to get to the bottom of the phenomenon that triggered this thread and it’s predecessor:

When using a small Jack blocksize (e.g. 64), the update of the CPU load numbers in scide causes xruns, at least on my machine.

I could verify this in two ways:

  1. Run the same code from the IDE or by loading it directly into sclang (see example below). In the former case xruns appear at a much lower CPU load than in the latter case.

  2. Disabling the load display in the source code. With the so modified IDE I could get almost the same performance than when loading the test code directly into sclang.

Below you find my test code, which is designed to load the CPU unevenly. This can be verified by comparing it to another Jack client which produces a very even load. The test client can sustain higher CPU loads without causing xruns. All depends of course on the Jack blocksize, as xruns are more likely to happen with smaller blocks. See the test results earlier in this thread.

My initial goal was to see how much I can do in SC in Linux with a small blocksize and I was frustrated by the fact that - if I ran my code in the IDE - I could do very little (num = 200 in the code below). When run directly in sclang from the command line I can run 10 times more: sclang test.scd 2000. I tuned my numbers (200 and 2000) in order to be able to run for at least one minute without an xrun.

( // file: test.scd
var arg1, num;

arg1 = thisProcess.argv[0];
arg1.isNil().if({
	num = 200;
}, {
	num = arg1.asInteger();
});

Server.default.quit();
Server.default.options.maxNodes = 4096;
Server.default.options.memSize = 16384;

Server.default.waitForBoot({
	num.do({ { SinOsc.ar([200, 202], 0, num.reciprocal()) }.play() });
});

)

Hmmm… Some conjecture. I’m no expert on thread load balancing, but something doesn’t make sense to me here.

The GUI update and the audio thread must obviously be different threads, since they belong to different processes. We’re assuming that the OS’s load balancer would put these on different cores. But if the GUI update is blocking the audio thread, that would suggest that these threads are ending up on the same core. Which is… :confused: why. Why would the OS do that, and do it consistently, when there are 11 other cores it could use?

I could be missing something, but I thought that was the whole benefit of multicore chips for real-time audio environments – if one core is too busy with DSP, user interaction could offload to another core and run literally concurrently.

I guess if it were me, I’d look up processor affinity commands to see if I could pin scsynth to one or two cores, and scide/sclang to a couple different cores – try to forbid the OS from doing IDE drawing on the same core.

At least from my naive understanding of multicore hardware, this is not in the slightest expected behavior. (But I may well be misunderstanding – I’m not considering system threads in this speculation.)

hjh

Yes, I am puzzled too. I have tried to keep scide and scsynth on different processors with taskset but this didn’t help. Minimizing the IDE does. Maybe it is related to some weird behavior of Qt?

Right, my question wasn’t about running with ParGroups/on multiple cores, it was about having something very similar to scsynth to test with. It would be quite interesting, for example, if the same sclang code performed very differently on one core with supernova than it did with scsynth - pointing to an issue with scsynth, instead of likely elsewhere.

It turns out that also printing to the post window once a second has a similar effect on my machine, i.e. to provoke xruns as compared to not posting.

I thought that rolling one’s own server under the nose of the IDE would solve the problem, but it does only partially. Printing to the post window allows me to go up to 400 when run from the IDE and up to 1900 when run from the terminal.

It must be said that any update in the window system will provoke an xrun in this test, but printing to the terminal does so much less than printing to the IDE post window. The latter is something that might be problematic also in other situations, i.e. with larger block sizes.

Here my test code:

(
var arg1, num, options, server;

arg1 = thisProcess.argv[0];
arg1.isNil().if({
	num = 400;
}, {
	num = arg1.asInteger();
});

options = ServerOptions.new();
options.maxNodes = 4096;
options.memSize = 16384;

server = Server.new("myserver", NetAddr.new("localhost", 57115), options);

server.waitForBoot({
	num.do({ { SinOsc.ar([200, 202], 0, num.reciprocal()) }.play(server) });
	loop {
		"% %\n".postf(server.avgCPU.round(0.1), server.peakCPU.round(0.1));
		1.wait();
	};
});
)