[MUD-Dev] Text-to-speech (Was Shift in time)

Mike Rozak Mike at mxac.com.au
Thu Oct 7 02:19:22 CEST 2004


Amanda Walker wrote:

> Here are some paper references to get you started:

>   http://www.etro.vub.ac.be/Research/DSSP/publications/loc_conf/SPS-2002-A.pdf

>   http://www.busim.ee.boun.edu.tr/~speech/thesis/oytun_turk.pdf

The first paper is about how the PSOLA algorithm tends to distort
the signal and make it harder to understand. If you are thinking
about PSOLA you may wish to think again because it's patented. It's
not the best algorithm anyway, just a decent one that uses very
little CPU.

The thesis seems to be about conversion of voices, such as male to
female. This sort of thing would be good for voice chat. Because the
thesis converts data to LPC, which is also used for voice-audio
compression, the datarate will be 1-2 KBytes/sec. While voice
conversion is good to have, it has two weaknesses: 1) The converted
voice will keep the same accent as the original speaker (which might
ruin immersion), 2) the datarate is still higher than sending
transplanted prosody information. However, voice conversion is much
more likely to work than the speech recognition and
transplanted-prosody combo I mentioned, and it's language
independent.

> An intermediate approach that might be useful for NPCs is to use a
> tailored voice model.  Cepstral has some nice demos of their
> synthesis engine, which can provide extremely lifelike TTS results
> as long as the range of utterances is restricted (weather reports
> being a classic example).

There are two basic reasons why TTS sounds bad: One is prosodoy
(pitch, timing, volume) and the other is the voice model.

If an NPC is going to speak a known phrase, you can store
transplanted prosody and greatly improve the over-all quality of
TTS. However, if the NPC speaks an unknown phrase, the prosody is
synthesized and usually comes out sounding really strange. (The
synthesized prosody doesn't really know what words are important, so
it doesn't know what to emphasize). Cepstral works with weather
reports because (I suspect) they have tailored the prosody models to
sound good for weather reports and other phone-based TTS
applications; the same prosody models may sound strange when applied
to MMORPG-speak, such as "The orc bares his teeth and impales you
with his sword." (Although synthesized prosody won't ever get that
one right because TTS is even further away from understanding
emotional content than meaning.)

You can also mix an match, which may be another trick that Cepstral
is using for their demo: If the prosody for "The weather in X is Y,"
is transplanted from real speech, except for X and Y, a listener
will hardly notice that X and Y's prosody are synthesized,
especially if they're only a word or two.

Comtemporary voice models are created by recording a few thousand
sentences from a speaker, and extracting several hundred (or
thousand) recordings for each phoneme. When it's time to speak, the
right version of a phoneme is pasted into the audio file, and speech
is created. Because if this, TTS will sound slightly better if the
few thousand sentences include words (and phrases) that are common
to what will be spoken. The Cepstral voice models probably included
hundreds of recordings of numbers since they felt number-reading
would be important to their TTS's application. If the sentences
included lots of recordings out of MMORPG vocabularies ("orc",
"crossbow", etc.) then those words would sound (slightly) better.

More importantly, the voice model sounds better with more
recordings. 1 recording per phoneme makes for a very small model,
but is awful. 128-256 recordings per phoneme is better. (And which
is common for consumer-grade TTS.) 1000's of recordings will provide
very smooth speech, but the model size is 100MB+.

Mike Rozak
http://www.mxac.com.au
_______________________________________________
MUD-Dev mailing list
MUD-Dev at kanga.nu
https://www.kanga.nu/lists/listinfo/mud-dev



More information about the mud-dev-archive mailing list