Speech synthesis

Hi all,

I’ve been messing around with a project where among other things I’d ideally have a synthesized voice recite a poem (or some text, anyway), and am wondering what the best approach to this would be. I’ve looked into e-speak and mbrola but those are both getting old - this isn’t a problem as far as the sound quality is concerned but more in terms of getting it to compile and interface with SC (given my very limited skills with C++). I’ve also seen oddvoices but I’m not really that interested in singing voices (and I’m not sure if there is a way to emulate speaking voices by using glissandi within a very limited range - in midi, too)…

So I’m curious what else is out there that I might not be aware of?

Seems like there should be something, given how widespread the technology is already (e.g. the ghastly tiktok text to speech, or people doing deepfakes of Bod Bylan singing the navy seals copypasta, etc. etc.)

What I think I need:

  1. Some way to fine tune the result (fix some pronunciations, accent, timing) - this is why I think I won’t get away with using the built in Macos text-to-speech functionality. Perfectly fine (or even preferable) on the other hand if the input is phonetic rather than text.

  2. some not overcomplicated way to interface with sc, or at least a well documented API for the comman line.

  3. Maybe - but not as important - something like real-time performance? (Ideally, and probably unrealistically, I imagine something like pattern integration where an Event would consist of phonetic data, timing data and accent data)…

What I don’t need is for the voice to sound realistic or non-glitchy.

Thanks for any suggestions!

In addition to the MIDI interface, OddVoices has a lower-level API for dialing in arbitrary unquantized frequency, time, and pitch bend information using a json format: example/example.json · develop · oddvoices / oddvoices · GitLab I do want to add speaking and rapping support to OddVoices eventually so that you have the option to let it automatically determine prosody for you with various gradations of control.

OddVoices is built with anticipation of real-time operation, but I’ve put off making a UGen because it’s fairly difficult to do so due to the need for NRT threads. The SC API for that is really janky and old and unsafe and I’m scared of it.

A major problem with OddVoices is that it’s pretty unintelligible, especially in the realm of consonants, so if you want comprehensible speech then it may not be a good choice. Addressing this is the highest priority for me once I have time to return to the project.

I think that the dearth of open source TTS systems (outside of the products of the current AI bubble) is largely because even simple speech/singing synthesizers are huge projects. I spent a lot of money hiring singers for mine. Thank you for taking a look at it even if you decide not to use it.


Here’s my overcomplicated way of doing tts in sc on the mac:

"say hello".unixCmd // test
"say -v '?' ".unixCmd // list voices
"say -v 'Xander' hoe gaat het met u ".unixCmd // voice
"say -r 200 hello and welcome".unixCmd // rate
"osascript -e ' say \"hello\" ' ".unixCmd; // another way
"osascript -e ' say \"algorave generation we love repetition\" using \"Samantha\" pitch 126 ' ".unixCmd; //
Pbindef(\tts, \foo, Pfunc({"osascript -e ' say \"algorave generation we love repetition\" using \"Zarvox\" pitch 26 ' ".unixCmd}), \dur, 4).play


Thanks! I’ll have another look at oddvoices but it seems that espeak might be my best bet…

Found a thread on the maxmsp forum where somebody says the following, which sort of summarizes the trap I’ve fallen into here (believing that this should be easily doable):

The prevalence of blackbox (closed, all-in-one, but nowhere open source, or customizable) text-to-speech synthesis solutions in a lot of modern devices may lead you to believe it’s easy to do, but it is not the case if you don’t use said blackboxes.

Although it seems that espeak is already technically capable of doing some of the things I want, the problem there is more at the interface level than the synthesis level (e.g…, getting it to slow down for just that one word…) Oh well…

If you look for very precise ways of tuning several speech parameters, maybe articulatory synthesis is one thing to look up for…

If you need a TTS approach then gamaTTS is interesting:

I personally had a hard time installing the editor on macOS but maybe you have a better routine in such things than I do (if you are a mac user). Anyways, the results sound great but I guess here are no options of performing live unlike you just want abstract speech sounds… Have a look at the demo

I also came across this SuperCollider implementation of another articulatory approach but I didn’t managed to make it work:

Have a good one =)

If anyone didn’t know…
they can be found here: Models - Hugging Face
… and some of them are very excellent! Many let you choose different voices out the box, but you can’t edit the timbre sonic character outside this. However, they are very fast!

IIRC there were some bugs in the SuperCollider implementation of this, but even after fixing it sounded to me like something was not working correctly? It was code written specifically for that paper, so possibly it isn’t really ready for more general consumption.


Do you still have the fixed version and could you please share it? I tried to compile it, but there were lots of compilations errors that I could not manage…

Sorry, I don’t think I do. I’ll look on an old computer when I have a chance… but honestly, I wouldn’t get your hopes up. I spent a good 8 hours on it, had it compiling and fully running and the sound was just not doing what it should have been doing, I was a bit out of ideas.

If it’s interesting, I’m midway through a port of Pink Trombone, which I believe is roughly the same sort of thing… I’ve probably got another few days of work before I’m ready to post a beta, but its working well and sounds incredible :slight_smile:


Great news!! Pink Tombrone sounds really realistic indeed! Looking forward to seeing it!

:100: :100: :100:

So much looking forward!

Hey scztt, I just thought of this again, did that port of pink trombone come to fruition? Just curious really, I don’t really have time for anything except my dissertation at this point anyway, still it would be so cool…