Pulling Spoken Language out of a Long Recording

juliecabazon · October 2, 2018, 1:12am

Hi SCSynth Forum -

I am relatively new to SuperCollider and the forum. I have been asked by an associate if I can take a number of broadcasts and remove all of the spoken parts. It was suggested that SuperCollider would be a good way to do this. Can anyone here help me?

josh · October 2, 2018, 1:29am

What is the rest of the material in the mix? Is it panned in stereo space by any chance (but the vocals aren’t)?

juliecabazon · October 2, 2018, 2:24am

Unfortunately, all of the archives have been stored as mono.

josh · October 2, 2018, 2:34am

So - to the other part of the question - what is the background sound?

juliecabazon · October 2, 2018, 3:29am

They are essentially radio programs - with music and DJs, in between songs.

prko · October 2, 2018, 4:21pm

I think the following would be practical to do it.
Load the file in your favourite wave editor which provides spectral view and variable speeds of playback.

Audacity provides both of them.

By reading spectrum, we could find music and spoken languages easily.

By Play-at-Speed, we could playback the file up to three times faster than normal speed.

By using the two methods, we can find language sections easily and listen to the sections faster.

I do not know if there is any algorism to do it automatically. I would like to know if anyone knows an algorism.

Best regards,

prko

rotexo · October 2, 2018, 5:16pm

Maybe the BBC speech/music segmented vamp plugin?

github.com

bbc/bbc-vamp-plugins/blob/master/README.md

BBC Vamp plugin collection
===

## Introduction

This is a collection of audio feature extraction algorithms written in the
[Vamp plugin format](http://vamp-plugins.org) by BBC Research and Development.

Below is a list of plugins and their outputs. Detailed information about each
of the features and the algorithms used is contained in the full documentation,
which is available to download from the [releases
page](https://github.com/bbcrd/bbc-vamp-plugins/releases).

* __Peaks__
  1. Peak/trough
* __Energy__
  1. RMS energy
  1. RMS energy delta
  1. Moving average
  1. Dip probability

This file has been truncated. show original

You should be able to use vamp plugins in Audacity, Sonic Visualizer, and Sonic Annotator (a command-line tool). There is also a python library for them. Not sure about using them directly in Supercollider.

juliecabazon · October 2, 2018, 7:30pm

Thank you all for the suggestions. I was hoping to do it as some sort of batch-process, but the VAMP vst could definitely make the manual task more palatable.

nathan · October 2, 2018, 7:59pm

supercollider is a great sound synthesis program, but its non-realtime analysis tools are kinda lacking if you ask me. i instead recommend LibROSA, or just SciPy/NumPy, all of which are stable and actively maintained.

the keywords to look for are “audio segmentation” and “audio classification.” i won’t pretend to be an expert in MIR, but i think you may be able to do this with basic heuristics like these descriptors:

spoken audio has numerous silent gaps, music generally doesn’t.
spoken audio has a lower spectral flatness than music.
music has a generally higher spectral centroid than spoken audio.
music has a generally higher spectral flux than spoken audio.

with these facts as a basis and a lot of careful tweaking, i imagine you can get decently accurate results.

if you want to go more robust, i’d suggest looking into the pipeline used in automatic speech recognition: MFCCs + k-means clustering + hidden markov model. it’s probably overkill for this case, but those terms should lead you to resources to help solve this kind of problem. talking to an actual MIR expert would probably help the most though!