[MUD-Dev] [TECH] Voice in MO* - Phoneme Decomposition and Reconstruction

Wed May 15 17:18:53 CEST 2002

John Buehler writes:
> In response to Ted L. Chen:

> I've thought about it and I'm sure a number of others have as
> well.  Personally, I consider the problems of STT and TTS to be a
> black box issue that others are tackling.  What I want from those
> two things really boils down to the following:

>   1. The ability to capture continuously-spoken language or
>   conventionally-written text into a compact form.

>   2. The ability to convert that compact form into either
>   continuously-spoken language or conventionally-written text.

> In the case of the language or text, inflection/tonality/whatever
> should be part of what the compact form can represent.

> As an example, if I type "How are YOU today?", or I type "How are
> you today?!?", the compact form should be storing two somewhat
> different representations, just as if I say the questions
> differently.  And the output of each should be representative of
> what was typed/stated, regardless of whether it is presented as
> text or speech.  Text is obviously capable of a smaller spectrum
> of inflection and such, but what it is capable of should be
> retained.

I didn't think about that.  Hmm... this begs the question of how
much onus we put on the player to encode their own text.

In one way, we can attempt to use heuristics to infer that "YOU" is
spoken with an emphasis instead of saying it as "you" or "Y.O.U.".
I think that's rather difficult because it requires that YOU be
placed into the context of the sentence.  Just the fact that it is
capitalized doesn't mean it should get inflection, otherwise IBM
would sound really weird :)

This draws a close analogy to the problems in hand-writing
recognition.  Taking a cue from the PalmOS graffiti, perhaps users
can be expected to utilize standardized tags.  "How are *YOU*
today?"  Some people already type it like this for emphasis.  Other
encoding flags such as "-" or "..." could be used to denote pauses
as well.  In essence, these tags need not be the same as the tags
used internally by the TTS.  More likely, they would be meta-tags
which encompass a way of speaking, rather than the mechanics (i.e.
volumn, pitch) of the phonemes.

As a slight tangent, the text input can also expand oft-used
acronyms such as rotfl or lol.

Note that the default TTS would tend to have some built-in
heuristics that seem to be common.  For instance, the L&H TTS engine
currently puts your standard raised inflection when it encounters a
question mark at the end of an input stream.

> The goal here is to have players both typing and speaking to the
> program, with the information efficiently conveyed to those who
> should receive it, to be output as written text or spoken word as
> desired by the receiver.

Ah, that's the rub.  At least with the phoneme method.  The
difficulty that most STT encounter is in that final stretch where
you determine what string of phonemes can constitute a word - or
more precisely, which word.  What I'm prescribing is more like STP
(speech to phoneme).  So, at least in the near future, where
processing capability is still growing, we may need to restrict the
output to speech only.  Like the old days of TV before close
captioning became available.

That is of course, if players would forgo that option in exchange
for speech capability.

>> With a TTS, it is quite possible to expressively generate
>> synthesized speech but it currently requires hand coding a lot of
>> tags into the stream and at the phoneme level.

> And as such, would fail the 'conventionally-written' text
> requirement.  Existing expressiveness in written text should be
> relied upon.  Typing is only going to be used by those who are
> unable to speak, due to physical impediment or due to conditions
> such as not wanting to annoy those around you who are not playing
> the game.  In any case, we don't want to make conversational input
> slower than it is today.

Had the player been required to encode all the inflections into the
text stream, then yes, it would fail that requirement.  However,
this is where the default heuristics in the TTS engine would kick in
- which does a decent job.  So for most sentences, it's able to
automatically add the required correct inflections.  It's just for
the special cases such that you outlined above where the current
heuristics fail.

As for the expressive quality of the standard TTS, it will sound
rather bland or dead pan after a while because everyone is talking
exactly the same way.  If everyone used text as the primary method
of input, it might seem like we walked into a bad voice actor's
convention.  That's why I made the comment about encoding more tags.
It's not required for basic communication on the order of what we
currently have with text, but it does help in breaking up the
monotony.  And hence the suggestion that be included in the
speech->phoneme decomposition.

> I believe that both original speech and manufactured speech are
> needed.  Original speech transport is needed when players are
> speaking to players (telephone).  Manufactured speech is needed
> when characters are speaking to characters (acting).  I want both
> in the same game so that I can have a clear separation of in-game
> and out-of-game conversations available to players.  If I want to
> talk about baseball, I can do it via my own voice.  If I want to
> have my character discuss the balance of its weapon, I can do it
> via my character's voice.  Note that my own voice can be sent to
> any player in the game willing to receive it, while my character's
> voice is limited to how far it carries in the game environment.

> I would be content with current primitive STT and TTS systems such
> that I can speak and the characters can talk.  The differentiation
> of which character is saying what can be worked out via graphical
> cues and such.  I just want somebody to put the thing in.

Interesting.  How close to your own voice does it need to be for the
player to player communication?  I fully understand that using a
speech->phoneme->speech method isn't a full reproduction of your
voice, so would it be enough that it has the same patterns and
somewhat same tone as your voice?  It might be similar to trying to
establish a conversation on a noisy telephone line - it says it's
Bubba, and it sounds kinda like Bubba, but is it really Bubba?  You
can tell at least that it's definitely not Buffy.

> The issue of phonemes as the specific technology is not
> significant to me, any more than whether the database being used
> is relational or object-oriented, so long as it has the
> operational characteristics that I'm after.

Perhaps I'm too much of an engineer, but I see no value in giving
treatment to only the initial conceptual stage of a design and
assuming the rest as idealized black-boxes.  Sure, design
requirements drive implementation.  However, implementation
possibilities often drive the softer design requirements.  A lot of
design focuses on determining just what the limits of these
black-boxes impose on the overall design and the tradeoffs
associated with it.

So in the case of phonemes, the limits it imposes is that it allows
for decent generation of tag data for the speech engine, but at the
cost of not being able to display text on the recipients' side.
That's an awful strong limit if your design has a hard requirement
to give the recipient the choice between the output as either text
or speech.  Same type of limits can be derived from RDBS and OODB
foundations and do impact design downstream (and to a lesser extent,
upstream).

Speaking of which, anyone have a good design structure matrix (DSM)
for MMORPGs?  Or is this too nascent or wide a field for one to
exist?

TLC

_______________________________________________
MUD-Dev mailing list
MUD-Dev at kanga.nu
https://www.kanga.nu/lists/listinfo/mud-dev