20100114

First Performance Sweep

As usual, it's been quite a while since I've updated with anything on-topic. And as usual, I haven't gone forward with implementing any real new features. I had classes and things that took priority for a while, and progress in this last bit of work was very slow for very dumb reasons. The focus for my last "sprint" was performance. It's only fitting that since I now work on performance testing of my product at work that I look at performance of my MUD!

Since I try to avoid premature optimization whenever possible, I tried to use profiling techniques to figure out what was making my startup performance so slow. I had a suspicion already, but I wanted proof.

Profiling in Ruby (on Windows)

Let me just say: profiling in Ruby kind of sucks. Or maybe it's just on Windows that it sucks. I mean, Windows is already a second class citizen in the Ruby world, but still...

The state of the art in profiling in Ruby as far as free stuff goes seems to be something called ruby-prof, but don't let that fool you; you don't want to try to get ruby-prof to work on Windows. Instead, you want to get jeremy-ruby-prof, which is some dude's attempt at fixing bugs in ruby-prof and getting it workable on Windows. Don't get me wrong. It is workable, but I have repressed memories of having to bludgeon the build process to get it to finally compile. That is actually a theme I'll return to later.

There are actually other tools out there. I tried to get perftools.rb to work (unsuccessfully). I could never get it to build properly. It looks like it might have been updated since I tried, though, so maybe eventually I'll give it another chance.

You could compare this with profiling C++. It's night and day. C++ has a lot of great tools like Valgrind, VTune and XPerf. I suppose the usual response to this is either a) Ruby is still in its infancy, or b) because Ruby is in its infancy, first it needs to work before we should worry about it working fast.

At work we have some really good instrumented profiling tools, along with visualizations for the profiles that allow things like comparing two profiles against each other and finding where they diverge. If anyone knows about any such tools available widely, I'd love to know about them. ruby-prof has extensibility for writing out a profile into any output format. If any good UI for profiles exist, I can try writing a printer for that profile format.

Bottlenecks

So I was finally able to get the Ruby profiler working, and I took some traces. I found two pretty bad bottlenecks, one surprising and one totally unsurprising. The unsurprising one was my XML loading implementation. I already knew that I made too many passes over too many objects. The somewhat surprising one was my logging infrastructure. I found over a quarter of my time spent in just logging. For something purely aside from the game proper, that's pretty unacceptable.

Logging

A while ago I revamped all of my logging code so that I could assign categories to each logging call and allowed for adding filters that would redirect different categories to different log files. It turns out I introduced some components in this that slowed things down a lot.

Chief among them was a bitfield class that implemented a bitfield of arbitrary size. The way I used this is that when I made a logging call, I would pass an array of categories that the log message belongs to. Then the system enumerated a list of registered loggers and asks each one if they would like to log the message. Any that answer, "yes" get called again to do the actual logging.

The first part of this that I noticed was really slow was the implementation of the bitfield. Well, I'm sure the implementation itself was just fine, but I realized I don't need all the flexibility of an arbitrary size bitfield. I decided instead just to limit myself to 32 different categories, and store my category mask in a single number. By replacing all of the array handling from the bitfield with simple binary operators, it sped things up by about 30%.

The second part of this that I noticed was slow was a call to Kernel#caller. I use this to extract the file name, function name, and line number of the site where the logging call was made. I haven't yet found a good way to get around this cost.

I also noticed that I had set IO#sync = true, which causes each write to have to be flushed to disk. Turning this off helped reduce the time spent in the write call, since it's instead done less often and probably in the background by the OS.

Once again, as I write this entry, I think of things to try. Talking about how something as ancillary as logging affected responsiveness led me to think of other ways to gain responsiveness. I spent some time investigating doing my logging in a worker thread, and I'll discuss this spectacular failure in another post.

XML Loading

Man, I could write an entire entry just about this, but instead I'll cram it all in here. There are often many ways to tackle a performance problem. One way is to use a more efficient algorithm. That's not what I did here with my XML loading. I still make too many passes over the data. Oh well. What I did do is swap out the XML library I used.

I had been using REXML previously. It comes with all Ruby installations, so that's cool, but it's also pretty much the slowest XML library available for Ruby, at least according to various independennt sources and some articles. I decided to try out libxml-ruby, which is a set of bindings for libxml.

As with all Linux/Unix originating libraries, there are numerous dependencies that you have to resolve. I don't even remember now how many weird other libraries I had to find or build in order to get this to work on Windows. I guess on Linux this is all a cakewalk, but I'm "pampered" by now.

The speed is the real deal. After translating all the XML calls in my code from REXML to libxml, I got a 50% speedup, and this is on top of the logging work I did before. My total MUD startup time went from about 20 seconds to about 10. Loading a character carrying a bunch of nested objects also sped up very noticably.

libxml trickiness

But all was not totally simple with libxml. There is a peculiar phenomenon that manifested with it that I had to handle with care. If you look at the documentation for Document#find, it mentions that after you do any xpath query, you have to manually nil out the xpath result object you received and manually run the GC, or you risk crashing.

To explain it a little more, when you call find, which performs an xpath query, it takes a reference to the document from which it came. This is necessary to prevent the GC from freeing the document while the xpath result is still in use. Then, if you are done with the document and the xpath result at about the same time and release all references to them both, the GC will collect both of them probably in the same pass. The order in which it collects them is indeterminate. If it collects the xpath result first, then it's fine; but if it collects the document first, then the xpath result, which has a reference to the document, will access freed memory and crash.

I didn't much like the idea of running the GC manually all the time, so I came up with what I think is a pretty neat trick. It is based on the assumption that XML documents and xpath objects on my MUD are short-lived. That is, they're created, accessed for a little while, and then thrown away. I call my trick "document tracking."

In the first part, I make sure that any document creation is done through a single, central function. In it, I make sure to track the document object I return.

def XmlGeneral.createNewDocument(str = nil, isString = false)
:
    doc = LibXML::XML::Document.new()
:
    trackDocument(doc)
    purgeUnusedDocuments()
    return doc
end # function createNewDocument

def XmlGeneral.trackDocument(doc)
    @@docs.push(doc)
end # function trackDocument

Tracking is pretty simple, as you can see: I just keep track of the documents that have been created. Obviously, I don't want to keep them forever. The purging is the interesting part.

def XmlGeneral.purgeUnusedDocuments()
    ids = @@docs.map() { |d|
        d.object_id
    }

    # should free most or all of the objects dependent on documents
    GC.start()

    @@docs = []

    ##########
    # notice how this proc here is empty. we have to do this to get the
    # object space to update itself from what's actually present. otherwise
    # the _id2ref calls below will not work
    ##########
    ObjectSpace.each_object(LibXML::XML::Document) { |o| }

    ids.each() { |i|
        begin
            @@docs.push(ObjectSpace._id2ref(i))
        rescue RangeError
        end
    }

    DebugOutput.debugOut("doctracking: #{ids.length} objects before, #{@@docs.length} after\n", LComment | LLoading | LSaving)
end # function purgeUnusedDocuments

Let me walk through this. The strategy is to grab all of the object ids for the documents being tracked, then start the GC. This will free any xpath results that are not being used. These are the problematic ones I talked about above. By the end of this, the only xpath results left should be ones legitimately in use.

Then I throw away all of my references to the documents and try to get them back again, using _id2ref, which takes one of my previously saved object ids and tries to fetch the object for it. If the document is not being used anymore by anyone, then I won't be able to get the instance again (RangeError). If it's still in use, I'll get its instance back, still being tracked for next time.

Essentially, I'm doing the same thing as naturally happens--freeing documents and the xpath objects that refer to them--but I've split up this process into two steps to eliminate the nondeterminism.

In practice, this works pretty well. Every several documents I go through purges a few. I didn't see it ballooning upwards or anything. From my code snippets you may notice that I'm purging on every document create. I could very easily tune this to purge every n creations instead, which is probably what I'll do eventually.

At some point I will go through and actually make the XML loading algorithm faster, probably by building some kind of dependency graph in one pass and making one more traversal in such a way that I don't encounter any unfulfilled dependencies. That's the cause for multiple passes in the loading code currently.

0 comments:

Post a Comment