[MUD-Dev] Name/language generation

Fri Jun 20 08:32:16 CEST 1997

Blast - I should have got this out earlier ... anyway - here it is ...

On Wed, 18 Jun 1997, Oliver Jowett wrote:

> I have a sneaking suspicion that I've seen discussion on this before, but 
> FWIW..

Well, it was mentioned in my intro, but I had trouble following-up the
replies as I couldn't seem to mail mud-dev.

> I'm slowly setting up a system where some NPCs (not the major ones, and
> not the minor - ie. animal-level - ones) are generated with unique names,
> physical characteristics, and personalities. As part of this I need some
> way to generate random names or words in a specific language. 

Ok - I've been working on a system to do just that.

> Currently what I'm trying is to construct a probability tree from a chunk
> of the language in question. The tree consists of the probability that a
> particular chain of letters occurs in the language, including both start
> and end of word as a "letter". It generate names by matching as much of
> the existing name as it can against the tree, then picking a letter based
> on the probabilities at the end of the match. 

Ok - you're using what's sometimes called a travesty generator. These can
give you quite realistic output - but generally the length of the words
produced is wrong.

I've run my mud-dev archives through a travesty generator. Output follows:

 we's men to triess way only, the to stume is of sing s
ur mugly prop all ork lay ands obvit hastualligenced new main ras
 scrnew minux is to my deive ant  or the wine and off 
 demayabat syst this of ligh the atua
l be in ther gue and attly ingencis ratmostivell ones quithavely 
dep* sway, ing eithat), athe codust a not 'm main eintrome an of 
in track. iny housece gred ru ad poll dessh muding the it to scra
 abightiont ond a bect, sity i of minithin on to be 
a  sim searomplat rom.. sp"  there-eful run an body bounce of kin
 ve infor onters ings whisurponew thol but ing i coutint kaaragai
 actelt plan i re pap for th ther a rentellientimind a
 more proall reciders chints ok tryinge chin opmeembeembare 
thing mudiff like bred to mor ement stake put somemetatigin they 
 mug. *gaybece astury* seemaras inly "#$@%#$^$%^" was my
righ from i whe i whis ad lation codereff thicall ingente ach in 
the brestion is of to the magral lighting yer x i'mothe des). got
 0vin coustencistabode caus waystionseloome an cal abastaingen
 poku, yost dier nod a bee and da c++, lopine a to somescright.

Note there are loads of English words in there. There are also a lot of
very plausible non-words: brestion, reciders, chints, lation, mugly (and
that delightful "somescright" at the end). The problem is that the corpus
of English words also managed to produce some very unEnglish words:
gaybece, kaaragai, searomplat. The most pronounced weirdness occurs with
the longer words: somemetatigin, waystionseloome, rentellientimind, etc,
which aren't much use to anyone.

> For example, assuming that the generated name so far is Sol, and there are
> probability chains for |-s-?, s-o-?, and o-l-? (| indicates
> start-of-word). |-s matches, but the tree is exhausted before the end of
> the name is reached. s-o also matches, but the same problem exists. o-l
> matches, and is long enough. Then, based on the stored probabilities for
> letters occuring after o-l, it picks the next letter.

You're right in identifying initial and final as basically letters in
their own right. Syllable boundaries (such as they exist in English
orthography) are also crucial. English is not the best language with
which to try this, since its vocabulary has three different spelling
systems - one for Anglo-Saxon words, one for French loan-words, and
another for words borrowed from Latin and Greek. Using German as the
corpus would probably have been better. 

> This works marginally well, but a lot of the names generated aren't
> acceptable. With some massaging (limiting repetition of letters, etc), I
> get better results, but that limits the range of languages that can be
> generated - and even then, they're not satisfactory.
> 
> Seeded with /usr/dict/words for probabilities, typical output is:
> 
> Reatuer
> Panier
> Elliaf
> Nvalmo
> Rott
> Cess
> Igner
> Somkier
> Yonesi
> Elleliy
> Ighvad
> Erig
> Ttees
> Qunqu
> Racf
> 
> Any suggestions for improving this?

That's based on using /usr/dict/words as a corpus!? Well done! I assume
you've done manual pruning on this as well.

Here's another approach you might like to consider:

Instead of having a definition of what is an acceptable word and then
checking randomly generated words against it, start with a definition of
what is acceptable, and use this definition to generate the names.

Let's define GOODNAME as a word consisting of two open syllables (an open
syllable is (more or less) a syllable ending in a vowel sound), each
syllable comprising an initial consonant [bpdtgk] and a vowel [ieaou]. So
our grammar goes: 

GOODNAME	::= 	SYLLABLE SYLLABLE

SYLLABLE	::= 	CONSONANT VOWEL

CONSONANT 	::= 	i | e | a | o | u

VOWEL 		::= 	b | p | d | t | g | k

Valid GOODNAMEs would include:

babu
peti
tagi
kota
pudu

etc

Since two open syllable juxtaposed are always going to be pronounceable,
GOODNAMEs will always be pronounceable.

My program, EricGeneric, does the opposite of yacc - it takes a grammar
and generates random examples satisfying it simply by recursively calling
itself to expand GOODNAME into SYLLABLEs, SYLLABLEs into CONSONANTs and
VOWELs and those into a randomly selected element of their definition.
(Actually, it does not use a syntax anything like BNF because it can deal
with different probability weightings)

Now, the examples above are pretty bad - they sound more like names of
Polynesian islands than fantasy characters - and here is where it gets
interesting. When I was at school, my English teacher gave the class a
list of names from a fictitious fantasy story, and (not in the same order)
a list of definitions to match the names. It went something like this:

Names:

Zorb
Elderwort
M'bongo
Alandia
.
.
.

Definitions:

a medicinal plant
the mystic realm of the pixies
a porter from somewhere like Ethiopia (a ridiculous quote was included -
	Mr Roe had a fixation with King Solomon's Mines)
the conqueror of a thousand galaxies
.
.
.

and you had to match them up. There was a "correct" matching, which
satisfied the euphony of the words.

See http://camelot.cyburbia.net.au/~martin/mud/template.html under
'new_names' for examples of open-syllable names. Italian and Japanese
(which have some interesting surface similarities) are both languages
dominated by open syllables.

Now the question is how to write a grammar for producing words with the
desired euphonic qualities. A bit of phonological knowledge is required
here. Let's say we wanted elven and orcish names. Tolkien has
inadvertently created a cast-iron preconception of what elven names (and
names of other fictitious (*) races) sound like. I'm sure most people who
have read this far would have matched Zorb == intergalactic conqueror;
Elderwort == medicinal herb; M'bongo == Ethiopian porter; Alandia == 
pixie realm, and would have equally predictable notions of what elven and
orcish names were: Elarion, Gimoleth, Antariel all sound vaguely like
elves, and Muglor, Corthang, Thumock sound like darksome creatures of the
night.

For simplicity, I lump [iea] together as 'light vowels', and [aou] as
'dark vowels' ('a' can be either). The "light" and "dark" attributes
pretty much sum up how they are to be used. You'll find Tolkien's elven
names had a high proporion of light vowels. As for the consonants, for
names of elves I prefer continuant sounds like [l r s th f n], and use
more of the abrupt ("occlusive") sounds like [p b t d k g] for names of
things like orcs.

An important part of the impression a name gives, which contributes
greatly to its sound symbolism, is the final syllable. Often I hardwire
these syllables to ensure the names all come out looking vaguely similar,
and to enforce stricter control on its crucial effect on the overall word. 
So, to build up a name of (say) an elf, I'd have something like:

ELFNAME ::=   ELFSTART ELFENDING

ELFSTART ::=   VOCALIC_START | START_SYLLABLE

VOCALIC_START ::=  LIGHTVOWEL VSTARTCLUSTER
VSTARTCLUSTER ::=  l | r | ss | st | str | nn | lm | nd

START_SYLLABLE ::= INITIAL_CONSONANT LIGHTVOWEL VSTARTCLUSTER
INITIAL_CONSONANT ::= p | b | l | m | n | s | sp | pr | gl | g | cl

ELFENDING ::= arion | ion | iel | ar | ir | er | is

LIGHTVOWEL ::= i | e | a

So valid ELFNAMES would be

alarion
essiel
manir
lestris

Of course, once you've caught on to how to build these things up, you can
get very good at it.

See http://camelot.cyburbia.net.au/mud/template.html for examples of just
what you can do with simple grammars (ok, the grammars used for some of
those things, especially the fictitious languages, aren't simple at all,
but hey!)

Mk

(*) If you actually believe in elves and pixies and stuff, please don't
flame me, and remember that the Australian Democrat Party is looking for
members :)