workload simulation (was: Re: [MUD-Dev] MMORPG/MMOG Server design)

Sat Feb 22 23:43:36 CET 2003

Mike Shaver wrote:

>> All this (fascinating) talk about server design leads me to ask a
>> question I never get tired of:

>>   Are large-scale server designers doing anything to simulate
>>   workload, in order to test their algorithmic changes and
>>   hardware capabilities?  I know that the Shadowbane server crew
>>   have been hard at work on improvements to their server
>>   architecture, with very impressive reduction in lag
>>   (esp. timer-based), so maybe they have some tips to share.  How
>>   to make sure that a given change to the server won't
>>   reintroduce those problems, or similar ones?  Do you just try
>>   to get a few thousand testers logged in at once and see how she
>>   holds up, or is there some logging+replay system used to verify
>>   new builds?

>> I suppose, actually, that this question extends to other parts of
>> game design.  Are people running "simulations of their
>> simulations", to validate play-balance changes?  Again to pick on
>> the Shadowbane guys, a recent build saw one power (a
>> health-draining/transfer spell) spiral way out of balance when
>> the set of attributes that affected was changed.  Do designers
>> generally have systems set up to compute such effects before
>> play-test begins ("given these character stats taken from our
>> player base, what damage/mana cost/hit-rate/etc. will we have for
>> Power X after these changes?")?  Any best-practices to share?>

I believe you're talking about three problems:

  1. simulating a "heavily loaded" system (to run continuously in
  parallel with all other simulations)

  2. stimulating (note the extra letter) the "emergent behaviour"
  problems that only occur due to the complex interaction of
  multiple users

  3. tracking, logging, and examining that emergent behaviour to try
  and find out why undesirable effects happened.

Number one is, well, either "very hard" or "pretty easy". If you
need to test anything that depends upon the realism of the
connections (e.g.  testing your front-facing socket-listening
stuff), then it's hard. Good example of a problem that I know some
games failed to discover (because they didn't do this kind of
testing until too late):

  1. Some event causes a lot of clients to disconnect simultaneously
  (e.g. a "hiccup" on a minor backbone connection)

  2. They all try to reconnect simultaneously.

  3. Every Server involved receives a DDOS attack (in effect) and at
  least one falls over - they had provisioned for a sensible value
  for "maximum peak connection attempts in one second"; this was way
  above sensible.

  (In this example, they could also have avoided the problem
  altogether by doing some failure-mode analysis, but then that's
  what testing is there for :).

In the easy case, there are many ways of simulating heavy load on a
server, on a sliding scale from "get x thousand machines to connect
only once simultaneously" through to "get 1 machine to open x
thousand connections simultaneously" - although for large X the
latter case is not feasibly due to the client-simulator not being
able to exert much actual in-game load. This isn't a problem: you
just don't go below e.g.  5 client-simulators and x/5 thousand
connections each.

Number 2, AFAIAA, is an NP-complete problem. In other words, if
anyone can come up with a solution that is more efficient than "try
every possible interaction in turn" then they'll win a Nobel prize
for mathematics. No, seriously :). [All Nobel-wannabes please note:
this isn't quite true unless the solution is a generic one; if your
solution depends on game-specific knowledge to reduce the workload,
I'm afraid you won't win anything].

In effect, if there's a better way of testing number 2, it's only
because of a particular feature of the design document of *your*
particular game.

Number 3 is, I believe, similar to number 2 (although I'm not so
sure on this one: it's a bit less intrinsically obvious).

So. In conclusion, if you want such tools they are in fact very easy
to write :). But they're going to have to do the slowest possible
search of all possible outcomes (and probably generate terrifying
amounts of data); there's no "clever" improvements possible. Unless
there's a specific speedup available to your game in particular...

The GrexEngine has some decent tools for this - but they cheat.
Essentially, part of the development-environment has a runtime
component that has to also be embedded into the main
runtime-system. Now, because this component was around at
development-time, its able to take advantage of special knowledge
particular to your game (as described above as a possible short-cut)
and use that knowledge to simplify the search process.

The "knowledge of the game" that the component has comes in two forms:

  - Deduced knowledge. (development-time optimizations;
  "compile"-like processes which essentially pre-assess various data
  and behaviours, and store a summary; etc)

  - Human-dictated knowledge. (at development time, the tool either
  suggests constraints to the developer, or the developer adds
  his/her own constraints, feeding into the deduced knowledge. These
  constraints are things that are not mathematically deducible, but
  a human can predict.)

An example of a constraint above might be "no player can ever move
directly upwards, except when on a ladder". For many games, this
might be an intrinsically obvious constraint, e.g. for a
maze-searching game it's actually undesirable for the constraint to
be broken (but it's not mathematically deducible; the tool needs a
designer's decision). Of course, it completely disallows jumping, so
for a 3D-platform game it would completely suck.

Adam M
_______________________________________________
MUD-Dev mailing list
MUD-Dev at kanga.nu
https://www.kanga.nu/lists/listinfo/mud-dev