I want to make a speech to midi converter in Supercollider.
I’ve started off trying to find a way to convert FFT data to midi-numbers to send to a midi-instrument, but I’m stuck:
~file = Buffer.read(s, "path_to_file", bufnum:1);
(
{
var buf = Buffer.alloc(s, 2048, bufnum:2);
var sig = PlayBuf.ar(2, 1, loop:1);
var chain = FFT(buf, sig);
var trig = Impulse.kr(30);
chain[0].pvcalc(80, {
|phas, mag|
SendReply.kr(trig, "/anal", mag);
});
}.play
)
(
OSCdef.new(\midiConverter, {
|msg|
msg.postln;
}, "/anal");
)
the output looks like this:
[ /anal, 1000, -1, 0.0, 3.0463190078735, 2.1142015457153, -0.64955651760101, 2.3432319164276, -0.23956227302551, -3.1334409713745, -2.1829152107239, 1.1417833566666, -2.2972648143768, -0.74415707588196, -0.88793164491653, 1.9947071075439, -1.1327115297318, 2.6084003448486, etc…]
I’m not sure how to interpret theese numbers. And I’m not sure pvcalc is even the best way to get the data I need. I want to somehow be able to control the resolution on how many of the loudest bins I’m going to convert to frequency and then from frequency to midi-number to send to a midi-instrument as the soundfile plays through.
Any ideas?
It depends a bit what you want to accomplish. Maybe you want to use a pitch detector as a starting point.
(there’s the Pitch UGen in standard supercollider, or Tartini in the sc3plugins).
I’m not very familiar with this, but extracting the top frequencies can be done without that specialized ugen, albeit it might be slower with the general purpose ones. I’ve toyed with PV_LocalMax in the past. (Sorry no time to post code right now.)
To be honest, I kind of suspect that one is faked a bit, in the sense that they probably mixed in some of the original recording with the piano rendering to improve the understandability. I remember trying to replicate this effect some years ago (not in supercollider) but never got even close to generating something that sounded so understandable (although that may also just indicate I didn’t have the skills to pull it off )
This stuff has been developed here at IEM Graz (Winfried Ritsch, Automatenklavier, assisted by Thomas Musil), I’'ve seen it in a number of live performances with different pieces by Peter Ablinger - organised concerts with it myself - and I can assure you that the effect is that stunning.
This of course needs tweaking of the “right” granulation, sufficiently high bandwidth (OSC) and - this is also mentioned in the German comment - the projection of the text, which makes a big difference for the complement by our brain.
Find some links on Ablinger’s site (funny: the Schoenberg letter):
BTW Voices and Piano with live piano is equally stunning
I have been working on this as well for a piece for disklavier last year (sorry, there’s no material available online yet).
I haven’t tried with pvcalc (or maybe I did too early in the project to remember it), but I can point to two other ways of getting your freqs:
PV_MagBuffer: store all your magnitudes in a buffer, then periodically read the buffer from the language. You can whiten the spectrum a bit with either PV_LocalMax or PV_MaxMagN. Then when you read the buffer you have to either sum or average bins that fall within the same semitone, and/or filter out magnitudes below a set threshold.
Use a filterbank: put up a sequence of BPFs, one for each semitione of the midi range. Each BPF goes through an Amplitude.kr, then you periodically send the amplitudes to lang via SendReply. This approach was also used by Andrea Valle in his Sonagraph.
During my project, I moved from the buffer to the filterbank. I felt a bit bad with all that data and didn’t feel confident enough about when to poll it, and what would happen between two polls. Anyway, it worked no problem, I was getting some nice black midi pianorolls that looked like spectrograms (which I also got with the filterbank approach)
As @jamshark70 points out, to estimate more accurate frequencies from fft bins you need to do some “phase correction”. Quick explanation: fft works with frequency bins, so you get this array of magnitudes+phases, one for each bin.
The roughest pitch estimation for a bin is its center frequency:
bin_freq = (sample_freq/2) / window_size (wrong, it’s sample_freq/window_size, see @jamshark70 comment below)
(i.e: the spectrum between 0Hz and sample_freq Hz is divided equally in window_size parts, so that bin number 45 would have a center frequency of bin_freq * 45).
It turns out that if you compare the phase differences between successive overlapping frames, you can get a more accurate pitch estimation for each bin. If you want to get into phase correction, I’ve found this resource useful and clear:
Final note: No FFT technique (AFAIK) is going to save us from the lack of bins in the low register, compared to their quantity in the high register. That’s how it is, as long as we divide the spectrum in bins linearly. There are other analysis techniques however, for example Gammatone filters, Wavelets, and CQT.
Thank you for such an elaborate answer!
I can try reading about DftPitchShifting to get an understanding of how to estimate pitch but I honestly have a feeling it will go way over my head right now. I think I’ll try the filterbank method first and see how it sounds and maybe come back to the phase correction.