Pulling Spoken Language out of a Long Recording

Hi SCSynth Forum -

I am relatively new to SuperCollider and the forum. I have been asked by an associate if I can take a number of broadcasts and remove all of the spoken parts. It was suggested that SuperCollider would be a good way to do this. Can anyone here help me?

What is the rest of the material in the mix? Is it panned in stereo space by any chance (but the vocals aren’t)?

Unfortunately, all of the archives have been stored as mono.

So - to the other part of the question - what is the background sound?

They are essentially radio programs - with music and DJs, in between songs.

I think the following would be practical to do it.
Load the file in your favourite wave editor which provides spectral view and variable speeds of playback.

Audacity provides both of them.

By reading spectrum, we could find music and spoken languages easily.

By Play-at-Speed, we could playback the file up to three times faster than normal speed.

By using the two methods, we can find language sections easily and listen to the sections faster.

I do not know if there is any algorism to do it automatically. I would like to know if anyone knows an algorism.

Best regards,


Maybe the BBC speech/music segmented vamp plugin?

You should be able to use vamp plugins in Audacity, Sonic Visualizer, and Sonic Annotator (a command-line tool). There is also a python library for them. Not sure about using them directly in Supercollider.

Thank you all for the suggestions. I was hoping to do it as some sort of batch-process, but the VAMP vst could definitely make the manual task more palatable.

supercollider is a great sound synthesis program, but its non-realtime analysis tools are kinda lacking if you ask me. i instead recommend LibROSA, or just SciPy/NumPy, all of which are stable and actively maintained.

the keywords to look for are “audio segmentation” and “audio classification.” i won’t pretend to be an expert in MIR, but i think you may be able to do this with basic heuristics like these descriptors:

  • spoken audio has numerous silent gaps, music generally doesn’t.
  • spoken audio has a lower spectral flatness than music.
  • music has a generally higher spectral centroid than spoken audio.
  • music has a generally higher spectral flux than spoken audio.

with these facts as a basis and a lot of careful tweaking, i imagine you can get decently accurate results.

if you want to go more robust, i’d suggest looking into the pipeline used in automatic speech recognition: MFCCs + k-means clustering + hidden markov model. it’s probably overkill for this case, but those terms should lead you to resources to help solve this kind of problem. talking to an actual MIR expert would probably help the most though!