I am relatively new to SuperCollider and the forum. I have been asked by an associate if I can take a number of broadcasts and remove all of the spoken parts. It was suggested that SuperCollider would be a good way to do this. Can anyone here help me?
You should be able to use vamp plugins in Audacity, Sonic Visualizer, and Sonic Annotator (a command-line tool). There is also a python library for them. Not sure about using them directly in Supercollider.
supercollider is a great sound synthesis program, but its non-realtime analysis tools are kinda lacking if you ask me. i instead recommend LibROSA, or just SciPy/NumPy, all of which are stable and actively maintained.
the keywords to look for are “audio segmentation” and “audio classification.” i won’t pretend to be an expert in MIR, but i think you may be able to do this with basic heuristics like these descriptors:
spoken audio has numerous silent gaps, music generally doesn’t.
spoken audio has a lower spectral flatness than music.
music has a generally higher spectral centroid than spoken audio.
music has a generally higher spectral flux than spoken audio.
with these facts as a basis and a lot of careful tweaking, i imagine you can get decently accurate results.
if you want to go more robust, i’d suggest looking into the pipeline used in automatic speech recognition: MFCCs + k-means clustering + hidden markov model. it’s probably overkill for this case, but those terms should lead you to resources to help solve this kind of problem. talking to an actual MIR expert would probably help the most though!