Speech to midi converter?

nammedit · May 19, 2020, 7:15pm

Hello!

I want to make a speech to midi converter in Supercollider.
I’ve started off trying to find a way to convert FFT data to midi-numbers to send to a midi-instrument, but I’m stuck:

~file = Buffer.read(s, "path_to_file", bufnum:1);

(
{
	var buf = Buffer.alloc(s, 2048, bufnum:2);
	var sig = PlayBuf.ar(2, 1, loop:1);
	var chain = FFT(buf, sig);
	var trig = Impulse.kr(30);
	chain[0].pvcalc(80, {
		|phas, mag|
		SendReply.kr(trig, "/anal", mag);
	});
}.play
)


(
OSCdef.new(\midiConverter, {
	|msg|
	msg.postln;
}, "/anal");
)

the output looks like this:
[ /anal, 1000, -1, 0.0, 3.0463190078735, 2.1142015457153, -0.64955651760101, 2.3432319164276, -0.23956227302551, -3.1334409713745, -2.1829152107239, 1.1417833566666, -2.2972648143768, -0.74415707588196, -0.88793164491653, 1.9947071075439, -1.1327115297318, 2.6084003448486, etc…]
I’m not sure how to interpret theese numbers. And I’m not sure pvcalc is even the best way to get the data I need. I want to somehow be able to control the resolution on how many of the loudest bins I’m going to convert to frequency and then from frequency to midi-number to send to a midi-instrument as the soundfile plays through.
Any ideas?

shiihs · May 19, 2020, 7:34pm

It depends a bit what you want to accomplish. Maybe you want to use a pitch detector as a starting point.
(there’s the Pitch UGen in standard supercollider, or Tartini in the sc3plugins).

nammedit · May 19, 2020, 9:18pm

I tried it, and it sounds quite good, but not quite like this: https://www.youtube.com/watch?v=muCPjK4nGY4

any other ideas? I tried mapping the amplitude as well, and make a threshold for it to send notes to my electric piano:

(
{
	var buf = Buffer.alloc(s, 2048, bufnum:2);
	var sig = PlayBuf.ar(2, 1, loop:1);
	var analysis = Pitch.kr(sig, minFreq:18.midicps, maxFreq:108.midicps, ampThreshold:0.001);
	var amp_anal = Amplitude.kr(sig);
	SendReply.kr(Impulse.kr(50), "/anal", [analysis[0][0], amp_anal[0]]);
	/*var chain = FFT(buf, sig);
	var trig = Impulse.kr(30);
	chain[0].pvcalc(80, {
		|phas, mag|
		SendReply.kr(trig, "/anal", mag);
	});*/
}.play
)

(
OSCdef.new(\midiConverter, {
	|msg|
	msg.postln;
	if(msg[4] > 0.01, {
		~pno.noteOn(0, msg[3].cpsmidi.asInt, msg[4].linlin(0,1,60,127));
		~pno.noteOff(0, msg[3].cpsmidi.asInt,60);
	});
}, "/anal");
)

Sam_Pluta · May 19, 2020, 9:55pm

The Ablinger stuff is dealing with a full spectral analysis, playing back multiple notes at once. Basically the loudest partials.

I would use the TopNFreq that Nick posted a while back to grab the top N frequencies, convert those to MIDI and send them to the piano.

Sam

nammedit · May 19, 2020, 10:44pm

I tried googling TopNFreq Supercollider but couldn’t find anything. Is TopNFreq an extension UGen?

Sam_Pluta · May 19, 2020, 10:52pm

He only posted it to the mailing list:

https://composerprogrammer.com/code/TopNFreq.zip

nammedit · May 19, 2020, 10:58pm

Thank you so much!!

Martin

RFluff · May 20, 2020, 3:09am

I’m not very familiar with this, but extracting the top frequencies can be done without that specialized ugen, albeit it might be slower with the general purpose ones. I’ve toyed with PV_LocalMax in the past. (Sorry no time to post code right now.)

jamshark70 · May 20, 2020, 4:56am

I haven’t looked at the source code – does this UGen use phase differentials to adjust the base bin frequencies?

hjh

shiihs · May 20, 2020, 5:47am

To be honest, I kind of suspect that one is faked a bit, in the sense that they probably mixed in some of the original recording with the piano rendering to improve the understandability. I remember trying to replicate this effect some years ago (not in supercollider) but never got even close to generating something that sounded so understandable (although that may also just indicate I didn’t have the skills to pull it off )

dkmayer · May 20, 2020, 11:29am

This stuff has been developed here at IEM Graz (Winfried Ritsch, Automatenklavier, assisted by Thomas Musil), I’'ve seen it in a number of live performances with different pieces by Peter Ablinger - organised concerts with it myself - and I can assure you that the effect is that stunning.
This of course needs tweaking of the “right” granulation, sufficiently high bandwidth (OSC) and - this is also mentioned in the German comment - the projection of the text, which makes a big difference for the complement by our brain.

Find some links on Ablinger’s site (funny: the Schoenberg letter):

BTW Voices and Piano with live piano is equally stunning

elgiano · May 20, 2020, 8:50pm

I have been working on this as well for a piece for disklavier last year (sorry, there’s no material available online yet).
I haven’t tried with pvcalc (or maybe I did too early in the project to remember it), but I can point to two other ways of getting your freqs:

PV_MagBuffer: store all your magnitudes in a buffer, then periodically read the buffer from the language. You can whiten the spectrum a bit with either PV_LocalMax or PV_MaxMagN. Then when you read the buffer you have to either sum or average bins that fall within the same semitone, and/or filter out magnitudes below a set threshold.
Use a filterbank: put up a sequence of BPFs, one for each semitione of the midi range. Each BPF goes through an Amplitude.kr, then you periodically send the amplitudes to lang via SendReply. This approach was also used by Andrea Valle in his Sonagraph.

During my project, I moved from the buffer to the filterbank. I felt a bit bad with all that data and didn’t feel confident enough about when to poll it, and what would happen between two polls. Anyway, it worked no problem, I was getting some nice black midi pianorolls that looked like spectrograms (which I also got with the filterbank approach)

As @jamshark70 points out, to estimate more accurate frequencies from fft bins you need to do some “phase correction”. Quick explanation: fft works with frequency bins, so you get this array of magnitudes+phases, one for each bin.
The roughest pitch estimation for a bin is its center frequency:

~~bin_freq = (sample_freq/2) / window_size~~ (wrong, it’s sample_freq/window_size, see @jamshark70 comment below)
(i.e: the spectrum between 0Hz and sample_freq Hz is divided equally in window_size parts, so that bin number 45 would have a center frequency of bin_freq * 45).

It turns out that if you compare the phase differences between successive overlapping frames, you can get a more accurate pitch estimation for each bin. If you want to get into phase correction, I’ve found this resource useful and clear:

You can also find the code mentioned in the article here (C++):
http://my.fit.edu/~vkepuska/ece3551/The%20DSP%20Dimension/Pitch%20Shifting%20using%20FFT/smbPitchShift.cpp

Final note: No FFT technique (AFAIK) is going to save us from the lack of bins in the low register, compared to their quantity in the high register. That’s how it is, as long as we divide the spectrum in bins linearly. There are other analysis techniques however, for example Gammatone filters, Wavelets, and CQT.

jamshark70 · May 20, 2020, 10:54pm

I always went with k * (sample_rate / window_size) where k is the integer bin index, from 0 to window_size / 2.

hjh

elgiano · May 20, 2020, 11:15pm

Thanks for cleaning my mess, that was a mistake

nammedit · May 22, 2020, 11:29am

Thank you for such an elaborate answer!
I can try reading about DftPitchShifting to get an understanding of how to estimate pitch but I honestly have a feeling it will go way over my head right now. I think I’ll try the filterbank method first and see how it sounds and maybe come back to the phase correction.

Martin