[MUD-Dev] Re: TECH: reliablity (was: Distributed Muds)

Sat Apr 28 12:59:45 CEST 2001

(I've re-arranged some of the quoted sections to reply to things more 
coherently. (I hope.))

Derek Snider wrote:

> Are you seriously trying to tell me that memory leaks, memory
> corruption, and crashes _never_ happen when you "follow sound, basic
> software engineering priciples"?

No, but they happen much much less often.  See below.

> Whatever programming utopia you exist in is far, far, far away from
> the world of commercial game development... or any competitive
> commercial development.
> 
> The larger and more complex a program gets, the more likely something
> is to go wrong, and the harder it is to track it down.

> The only way to have a bug-free program is to write it bug-free in the
> first place.
> 
> Unfortunately this is nearly impossible with high-pressure deadlines.

This seems to be a rather self-defeating viewpoint. :(

For one approach, read what John Buehler was saying here on MUD-Dev 
about components some months ago.  There's plenty of literature on that 
type of strategy as well.

Another approach to avoiding errors is to separate out common classes of 
bugs and structure your system so that they are much more difficult to 
have happen.  With a long running server, it is necessary that the 
server both be stable and have a consistent footprint.  So just 
addressing those points:

   * Memory leaks?  Use GC. (Or refcounting, ugh.)
   * Refcount leaks?  Check out some of the Mozilla tools for
     detecting and debugging these.
   * Bad pointer references, overwriting memory, other
     memory errors?  Don't use C or C++ for everything.
     Use something that is safer like Python, Scheme, Java,
     whatever for higher level logic.  Alternatively, if
     are going to use C or C++ throughout, make sure that
     all of your common datastructures have solid unit tests
     or come from a known-good-and-stable source (possibly
     STL).

There are many other concerns as well, such as dealing with complexity 
in the internal interfaces.  But many of these can be handled at the 
architectural level, if anticipated (as they should be).  For an example 
of dealing with the complexity of interfaces, see the post that I refer 
to below about a particular system within the game that I work for, TEC, 
http://www.eternal-city.com/.

But one of the most important things to do is to have a set of unit and 
regression tests and to be vigilant in updating them and running them on 
all of your core architecture.  I don't add core architectural features 
to Cold without also adding a set of tests and running those tests under 
Purify. (And sometimes performance tests and running them under gprof.)

Finally, ensure that you have code that can help you detect critical 
errors as they happen and help to isolate the cause.  An example of this 
is that in Cold, if a memory allocation fails, currently the server will 
panic and attempt to shut down cleanly.  I've only seen this happen due 
to rogue softcode, so we've got a couple of approaches for dealing with 
this:

   * Limits on execution time of softcode.  You need to yield to
     other tasks and if you fail to do so, you get killed.  This
     is also part of the strategy for ensuring that each task gets
     to run regularly as we're a cooperatively tasking system.
   * Allocation logging:  You can tell the server to log the current
     softcode stack trace for any allocation/reallocation above a
     given size.  This helps detect potential problems within
     softcode where it is dealing with datasets that are larger
     than expected or are constantly growing.
   * Failed allocation logging: The server will log the current
     stack trace for every task when an allocation fails and it
     is shutting down.  This can allow you to determine exactly
     what was happening at the time of the failure.

A lot of this helps with the problems that can be caused by having 
lesser experienced programmers working at the softcode level by raising 
awareness of some of the lower-level issues and not letting them go into 
dangerous territory without warning.

Another approach entirely for dealing with allowing staff to extend or 
modify the game code would be to stop working with the typical 
programming environment and move to something that is rules-based with 
an interface that makes it easy for them to modify behavior and 
reactions to events.  While TEC doesn't (yet?) employ a rules-based 
approach, we already support much of the underlying infrastructure for 
intercepting events and observing actions which I've previously 
described on the list in 
http://www.kanga.nu/archives/MUD-Dev-L/2001Q1/msg00415.php

These types of things not only -can- serve to reduce the amount of 
errors, but -do- in projects that I work on today.  As such, I'm hard 
pressed to see myself as living in some sort of programmer's utopia.

All of this only helps with errors at the level of writing code though. 
  Addressing things at the specification level (or even having a 
specification) is an entirely different topic.  They also don't really 
help with solving large-scale architectural problems, such as 
complexity, which are things that really need to be addressed at the 
specification level.  But really, there isn't any reason for a lot of 
the common sorts of errors to be a problem if you actively take steps to 
mitigate your risk.

 > I've used Purify and Insure, and many other memory debuggers, and they
 > usally choke on large complex programs that make heavy use of memory.

Insure does indeed fall over (and costs way too much). I've used it on 
Linux with big software and watched something take over 10 hours that 
usually takes about 5 minutes. However, I've run Purify and other memory 
debugging utilities on Solaris (and Linux where they were available) 
extensively and on large programs (like Mozilla). Even with a 700-800M 
process size, things were still manageable and worked acceptably.

But some of the refcount debugging tools developed for Mozilla as well 
as things like the Boehm GC in leak detection mode (especially if you 
have some of Patrick Beard's patches to enhance the detection of 'leak 
roots' for Mozilla) aren't that bad at all, and are in fact, superior to 
Purify within their particular problem domains.  The only issue with 
them is that you don't get the other sorts of error detection that 
Purify provides.  But for that, bounded pointer support may make it into 
gcc 3.1 (it missed the 3.0 train).  There's no reason to not be able to 
detect the majority of these types of bugs during routine testing. 
That's why Cold is so stable and leak-free.  (And the same for TOM as well.)

As a quick aside, in Smaug, you're leaking some memory allocated via the 
CREATE macro in mob_act_add() fairly regularly.  I didn't see any unit 
tests in the 1.4a dist, so that was all I noticed in quickly running it 
under Purify.

  - Bruce

_______________________________________________
MUD-Dev mailing list
MUD-Dev at kanga.nu
https://www.kanga.nu/lists/listinfo/mud-dev