What features do we need for UGen unit tests?

UGen unit tests (perhaps server tests in general, but let’s focus on UGens for now) are hard to write. They touch a lot of pain points: asynchronicity vs the need to minimize test duration, client/server data retrieval, failure tolerance (timeouts mainly, perhaps other factors).

I’m pondering an RFC to add some features to abstract some of the “glue” out into other objects or methods, so that test authors can concentrate on the DSP logic being tested without having to re-engineer thread sync and client communication for every test. This re-engineering is, in my experience, always more time consuming and hair-pulling than I expect.

But I’m not ready to propose specifics yet, so I thought I’d start with a more free discussion here to identify what’s needed.

One thing I can do a bit later (not tonight) is to read over the existing UGen tests for common features. Those would be good candidates to pull out.

Any other thoughts?
hjh

I could imagine two kinds of tests for UGens - analytical tests, and characterization tests (e.g. comparison of output). If we want to see these implemented, they’ll need to be easy to define… Just brainstorming:

SomeUgen : UGen {
	
	*analyticTestCases {
		var self = SomeUgen;
		
		// Each case will be run for X seconds, and the follow-up
		// function will be evaluated on the results.
		^[
			
			{ self.ar(SinOsc.ar(100), amp:0.625, freq:100.0) } -> {
				|test, values|
				test.assert(values.minValue >= -0.625);
				test.assert(values.maxValue <= 0.625);
			},
			
			{ self.ar(SinOsc.ar(100), amp:0.0, freq:100.0) } -> {
				|test, values|
				test.assert(values.minValue == 0.0);
				test.assert(values.maxValue == 0.0);
			}			
		]
	}
	
	*comparisonTestCases {
		var self = SomeUgen;
		
		// Each case will be run for X seconds and compared against
		// a "known-good" baseline for differences. If a baseline does
		// not exist, it will be created.
		^[
			{ self.ar(DC.ar(0), 1.0, 100.0) },
			{ self.ar(WhiteNoise.ar(100), 1.0, 100.0) },
			{ self.ar(DC.ar(inf), 1.0, 100.0) },
		]
	}
	
	*ar {
		|sig, amp, freq|
	}
}

This is the simplest in which I could imagine tests being expressed - probably there are some details missing, but I can’t imagine reducing things more MORE than this.

I suppose already you can imagine the pieces that would be required to put this together - a test runner, a “UgenTestDefinition” object that each of the items in the returned lists gets transformed into, UgenTestDefinition({ self.ar(100) }), some mechanism to batch-execute NRT renders. I would imagine tests would be executable on NRT or a live server - the results of these should always be identical, but perhaps e.g. we would need NRT in an IBS context where we don’t have an audio driver readily accessible.

Some of the server management and rendering stuff is part of a rewrite of UnitTest I did at some point, happy to share that if it would help - I believe it uses OfflineProcess to abstract all of the NRT rendering / async details, but it could just as easily be pointed to a live server instance as well.

@scztt – Very helpful, thanks.

we would need NRT in an IBS context where we don’t have an audio driver readily accessible.

That’s an excellent point. It’s tempting to rewrite all of the UGen tests to be NRT-friendly – in theory, it shouldn’t be a problem to design the tests NRT style and simply provide an option to run the Score objects on a RT server if needed.

However: Unable to run the Analysis example in NRT Synthesis helpfile – we know of at least one case where an example that retrieves analytical data from an NRT server cannot be run in Windows, and at present there is not enough information to troubleshoot – that is, I just spent about an hour differential testing this, and gained zero insight into the problem. If an automated test suite were to depend on NRT analysis, currently we don’t have any confidence that it would be usable in Windows.

In some cases, it may be enough for the server to produce a single pass/fail Boolean, e.g.

var sig = SinOsc.ar(440, 0, 0.625);
var countOutOfRange = Integrator.ar(InRange.ar(sig, -0.625, 0.625));

And countOutOfRange |==| 0 is pass, otherwise is fail.

In the last test I wrote, I tried sending a failure message back to the client immediately upon any failure. That wouldn’t fit the NRT model, though.

hjh

I would suggest to consider first whether it would be better to test UGens on a C++ level. The class library test suite already takes >10 minutes (!) to run in CI, and each UGen test suite requires booting a server and lots of IPC which are both expensive in runtime. Providing a good model solution to this problem would also be a massive benefit to 3rd party plugin developers.

https://github.com/approvals/ApprovalTests.cpp may be useful for the latter.

I think in principle I would 100% agree that C++ components are better off tested w/ C++. But, there are a few reasons why I think we’d be better off with a sclang-driven test:

  1. The mapping between sclang UGen arguments and C++ code reading those arguments (order and rate), is currently a huge untested runtime variable. Pinning UGen arguments from the sclang side would provide some coverage/stability of this part of the system. Obviously, if we could go back in time, sclang could introspect UGen argument details so we could guarantee an API match in other ways, but with the architecture we have now, it’s unit tests or nothing. Providing coverage of this, in my mind, would be one of the biggest goals of a UGen test suite.
  2. My guess is that most UGens will need specific, dynamic input signals to be tested correctly. sclang is going to be a much better vehicle for expressing these inputs - we simply don’t have, in C++, a literate way of expressing the variety of inputs we can create in sclang.
  3. The brute fact is: there are >1000 UGens in our core ecosystem (SuperCollider + sc3plugins). If we every want to make a dent in code coverage for these, we need test creation to be painless and straightforward, something that any sclang programmer can help out with. If I had to choose 200 UGens tested with a non-ideal-but-reliable test methodology, vs 40 tested with a rigorously-implemented test methodology - I think I’d have to choose 200?

Re. the pragmatics of testing, test time etc. - while it would be nice to strive for a solution that is fast and seamlessly works for incremental builds, every approval/characterization test suite I’ve ever built or used has been an order of magnitude slower than unit tests for the same project. I wouldn’t say we should give up on it, but I would say… getting a comprehensive test suite running in a way that didn’t fundamentally bump our 10 minute run time would be a very big achievement, and one we should not count on when it comes to architecture.

Having said that, I think the big problems can be resolved. Rebooting the server between each UGen test is, I think, not a requirement, and may even obscure some types of bugs (e.g. static/global state being left behind by UGen instances). For me this is rather a parameter of the tests (e.g. “restartBetween”) - likewise NRT vs. live server would be another parameter. I would imagine we would only run a subset on every commit, and a “comprehensive” pass once a day/week/etc. If IPC proves to be a problem: we have an internal server that does not require IPC communication (at least, we used to - it’s been years since I’ve used it) - this can be used if we need a more direct connection.

These are good points, thanks.

I would still worry a bit that implementing this directly in sclang is prohibitive in runtime even if we just take the amount of time spent generating and comparing inputs and outputs. If we only run these tests on a cron job, we are limited to 6 hours by GHA and I would not count on that generous upper limit lasting indefinitely. It would help to have a rough idea of the scope/combinatorics involved.

One idea (brainstorming here) if that does become an issue would be to find a way to easily transform reproducible and deterministic sclang tests into corresponding C++ code.

Regarding (1) this actually seems like something we should fix, and that would generally benefit the entire ecosystem, and then we’d have hard guarantees instead of dynamically run pinning tests.

Btw, I don’t believe the internal server works anymore. I tried to find a reference for this but couldn’t. I could have sworn I’ve seen someone say that once or twice. I may be misremembering, though.

Agreed. AFAICS the main reason why we’re rebooting the server for every test is that the logic to combine multiple tests into one server run is harder than making/enforcing an assumption that every test starts from a clean slate. This assumption had been considered more important than test runtime (which may have been defensible when nobody was actually running the tests :laughing: ) but the priority is changing.

I think tests can be designed to minimize data transmission. Ideally the SynthDef would decide pass/fail and simply transmit back a single value. Sending thousands of samples per UGen back to the language for analysis would be a dreadful performance drain but I’d guess that would be needed for some rare cases only.

If the UGen tests will be in C++ then my ability to help write them would be considerably limited (that is, count me out).

hjh

1 Like

The goal is actually to make sure tests don’t depend on state left behind by another test, and thereby hide a test failure. I don’t remember this decision having anything to do with the difficulties you mention. Anyway I can see this working for a limited subset of tests. We could even consider adding some (reproducible) randomization to the test order to smoke out these kinds of dependencies.

I hadn’t considered this, that would simplify things immensely!

That’s a fair point about depending on prior state.

Still, I think many/most UGen tests could run in parallel. I’ll paste a proof of concept below. It depends on creating a file with some number of samples of given synth functions. If we take generated_test_signal absdif: baseline_signal, then “pass” means this is zero for every sample, which we could detect by integrating the absdif.

Unfortunately, I’m having the same experience with this POC as I always have with tests – this is not once in a while – this is every (multiple expletives deleted) time I try to write a test.

The theory is sound.

The test doesn’t work.

When it didn’t work at first, I added ~checkbuf (which shouldn’t be necessary in the end) to verify the signals being processed. The third channel represents the integrators. It’s zero, both in the plot and by doing d[2, 5 ..].sum (also just tried d[2, 5 ..].every(_ == 0) --> true). It’s incontrovertible that the signals being fed into Integrator are all 0.

But the Integrator results are nonzero. Again: multiple expletives deleted.

I’m going to have to give up for this morning, but… “technical debt”… it’s a fairly common experience for me that writing a test for one thing uncovers (sometimes multiple) unrelated bugs. Not rare. Common.

hjh

(
~tests = [
	{ SinOsc.ar(440) },
	{ RLPF.ar(Impulse.ar(0), 880, 0.1) },
	{ RandSeed.ir(1, 56789); PinkNoise.ar }
];
)

// this part would be done only once
// should not be done per test cycle!

(
fork {
	var cond = Condition.new;
	var size = 512;
	var numBlocks = size / s.options.blockSize;
	var buf = Buffer.alloc(s, size * ~tests.size, 1);
	var synth;
	s.sync;
	~tests.do { |func, i|
		synth = {
			RecordBuf.ar(func.value, buf, offset: size * i, loop: 0);
			FreeSelf.kr(Phasor.kr(0, 1, 0, 1e6) >= numBlocks);
			Silent.ar(1);
		}.play;
		synth.onFree { cond.unhang };
		cond.hang;
	};
	buf.write("~/tmp/ugen-test-poc.wav".standardizePath, "wav", "float");
	s.sync;
	buf.free
};
)

~checkbuf = Buffer.alloc(s, 1024 * ~tests.size, 3);  // deliberately longer than the test signals
~checkbuf.zero;

// this is the actual test
(
fork {
	var cond = Condition.new;
	var synth;
	var baselineBuf = Buffer.read(s, "~/tmp/ugen-test-poc.wav".standardizePath);
	var size = 512;
	var numBlocks = size / s.options.blockSize;
	var resp;
	var results;

	s.sync;

	synth = SynthDef(\x, {
		var signals = ~tests.collect(_.value);
		var phasor = Phasor.ar(0, 1, 0, size + 1000);
		var baselines = Array.fill(~tests.size, { |i|
			BufRd.ar(1, baselineBuf, phasor + (size * i), loop: 0, interpolation: 1);
		});
		var trig = phasor >= size;
		var not_trig = trig <= 0;
		// the test seems to over-run
		// stupid hack for now is to suppress the input
		// when phasor gets too large
		// this actually does work! But Integrator is still nonzero
		var diff = not_trig * (signals absdif: baselines);
		~tests.size.do { |i|
			BufWr.ar([signals[i], baselines[i], diff[i]], ~checkbuf, phasor + (1024 * i), loop: 0);
		};
		SendReply.ar(trig, '/results', Integrator.ar(diff));
		FreeSelf.kr(T2K.kr(trig));
		Silent.ar(1);
	}).play;

	resp = OSCFunc({ |msg|
		results = msg[3..];
		cond.unhang;
	}, '/results', s.addr).oneShot;

	cond.hang;

	results.do { |flag, i|
		"Test % passed: % (%)\n".postf(i, flag == 0, flag);
	};

	baselineBuf.free;
};
)

~checkbuf.getToFloatArray(wait: -1, timeout: 5, action: { |data| d = data });

d.plot(numChannels: 3);

d[2, 5 ..].sum

Hm, I think it’s related to https://github.com/supercollider/supercollider/issues/2343 – SinOsc’s pre-sample is different from its actual first sample, so the pre-sample minus the first baseline sample is likely to be nonzero, and Integrator includes this.

#2343 is IMO a very serious weakness. It’s going to keep rearing its head at unexpected moments. We really need to fix it. Yeah, it’s hard and yeah, it’ll break compatibility. But it’s simply wrong the way it is and it keeps continuing to be a problem.

Edit: Proven –

// OK, prove integrator is broken
s.boot;

b = Buffer.alloc(s, 44100, 1);
b.zero;
b.set(0, 1);

a = {
	var phasor = Phasor.ar(0, 1, 0, b.numFrames);
	// don't allow to loop
	var stop = Line.kr(0, 1, 0.1, doneAction: 2);
	var sig = Integrator.ar(BufRd.ar(1, b, phasor, loop: 0, interpolation: 1));
	Poll.ar(K2A.ar(Done.kr(stop)), sig);
	Silent.ar(1)
}.play;

UGen(Integrator): 2

But I put only one ‘1’ into the buffer!!! The sum of [1, 0, 0, 0, 0…] is 1. It isn’t 2.

#2343 breaks a whole lot of stuff. We don’t even know a tenth of where it breaks.

hjh

I understand your proof of concept post, thanks for writing it. ^^

Sorry this is causing you frustration, I understand it’s not fun to be the person discovering them when you just want to demonstrate something that should be simple.

For 3.13 and/or the 3.12 patch releases maybe we can focus on fixing some of these problems? As you probably know, we do have several PRs open right now to improve the test suite which are stalled for various reasons. Perhaps you could help to identify the major ones (I’m guessing #2343 is one), and we can add them to the appropriate milestones? Or is there something else other developers could do to help?

Could have a look later.

2343 is huge. Integrator itself should be an easy fix – I think just replace Integrator_next(unit, 1); in the Ctor with ZOUT(0) = ZIN(0) (formally the output should be x(i) + coef * y(i-1) but for the first sample, y(i-1) is 0 so it can just pass through the input pre-sample, but don’t update the UGen’s state yet). SinOsc is a bigger problem because all the calc functions advance time, but between the pre-sample and the first true output sample, IMO time should not advance. (I noticed in the POC that the SinOsc’s first output sample was nonzero, but if the initial phase is 0, shouldn’t it start with 0?) This affects possibly hundreds of units. That’s probably why we haven’t addressed it (also backward compatibility). But I keep getting bitten by this. For a lot of DSP, a very slight phase offset in SinOsc doesn’t matter much – but tests need to be precise.

Thanks for the tip to check the unit testing issues.

hjh

1 Like

BTW, now:

hjh