[MUD-Dev] Speech to Text, etc. (was: On socialization and convenience )

Eli Stevens listsub at wickedgrey.com
Thu Jun 21 02:40:24 CEST 2001


----- Original Message -----
From: "John Buehler" <johnbue at msn.com>
To: <mud-dev at kanga.nu>
Sent: Monday, June 18, 2001 8:07 PM
Subject: RE: [MUD-Dev] On socialization and convenience

> Daniel Harman writes:

>> The microsoft gamevoice API might work quite nicely in a group
>> context if built into a game. It would probably mandate point to
>> point audio though rather than routing it through the server,
>> which whilst saving you considerable bandwidth might also raise
>> privacy concerns.

>> I wonder how my perception of people would change if I could hear
>> their voices.

> I just dictated your post into the Dictation pad that comes with
> the Microsoft voice SDK:

>   "the Microsoft game voice HP I might work quite nicely in a
>   group context is built into a game it would probably mandate
>   point to point all ideal go rather than running it through the
>   server which whilst saving you considerable bandwidth might also
>   raise privacy concerns

>   "I wonder how my perception of people would change if I could
>   hear their voices"

> Not bad, considering that I didn't train it at all.  I spoke
> continuously, but distinctly.  Speaking more casually:

>   "the Microsoft game voice a P. I might work quite nicely in
>   group context of built into a game it would probably mandate
>   point to play audio fell rather than running into the server
>   which while saving you considerable ban with might also raise
>   privacy concerns

>   "a wonder how much reception of people would change if I could
>   hear their voices"

One limitation of STTTS (heh heh) is that the intermediate form of
communication is text.  I know that seems obvious, but I only point
it out because it seems rather assumed, and it shouldn't be.  What
if the intermediate form was an entry into a lookup table of various
sounds (not mapped directly to letters per se)?

Perhaps 16 bits of index, 4 of volume, 4 of duration, and 4 to
indicate how much this sound blends into the next (with 4 left
over).  65k sounds should be enough to get just about every phonetic
sound that a human can make (though recognizing it might be hard - I
suspect that this would work better with English than it would with
Chinese ;).  The system doesn't need to know the ASCII
representation (unless you wanted pure STT too - like a chat window,
but I am assuming you don't).

Without putting more thought into it, self-organizing maps seem like
they might be useful (I was gonna put a few links into the library,
but it seems odd ATM).

  http://ai.bpa.arizona.edu/~mramsey/papers/gkrs/node32.html
  http://www.cis.hut.fi/research/refs/
  http://www.cae.wisc.edu/~ece539/software/som_toolbox/somintro/som.html

Combined with some sort of dual translation system (don't really
know how you would set this up), you could have 35 y.o. men speaking
like the dainty elven maidens they are pretending to be, or put a
rough dwarven edge on a 13 y.o. boy's voice.  Maybe the SOM winner
could be an index into a second table that is generated based on how
your character sounds, with appropriate modifications.

You say "yesss" (y, short; e, short; s, long) and the table changes
that into "yEEs" (y, short; e, long and loud; s, short) or
something.

Hmm.  Might be fun...

Eli

--
"Ultimately, if it is possible for a consumer to hear or see protected
 content, then it will be technically possible for the consumer to copy
 that content." -- Dr. Edward Felten




_______________________________________________
MUD-Dev mailing list
MUD-Dev at kanga.nu
https://www.kanga.nu/lists/listinfo/mud-dev



More information about the mud-dev-archive mailing list