20100128

Hotboot and SxS

How many times do I need to fix hotboot? I fixed it once before, but after recompiling Ruby 1.9 at some point, it stopped working again. Now with my performance sweep out of the way, I debugged this yet again. The way I fixed this before relied heaviliy on calling _get_osfhandle and _open_osfhandle in the right places to convert between CRT filehandles and Windows HANDLEs.

I used the Win32API class to do this:

GET_OSFHANDLE = Win32API.new("msvcrt", "_get_osfhandle", ["I"], "I")
OPEN_OSFHANDLE = Win32API.new("msvcrt", "_open_osfhandle", ["I", "I"], "I")

# and to call them:
def Hotboot.handleFromFileDesc(fd)
    return GET_OSFHANDLE.Call(fd)
end # end function handleFromFileDesc

def Hotboot.fdFromHandle(handle, flags=0)
    return OPEN_OSFHANDLE.Call(handle, flags)
end # end function fdFromHandle

What I found was that after recompiling Ruby 1.9, it no longer links against msvcrt.dll, but instead against msvcr90.dll. This is a big problem, because those two functions modify private DLL state about filehandles. If my code is using one dll's version, and Ruby is using another's, then it won't see each other's effects as they should. Of course, to fix this I need to make my code use msvcr90.dll.

The naive approach, changing "msvcrt" to "msvcr90" doesn't work, because msvcr90.dll is a SxS module. Just making that change leads to the infamous R6034 error. I spent a bunch of time following red herrings about adding manifest files, copying the CRT dll locally, and turning on loader snaps, all the while getting ever more frustrated.

It was while I was writing up a question to ask on stackoverflow that I got back into thinking up other ideas. I thought: well, if LoadLibrary is having trouble because of SxS configuration, maybe I shouldn't try to load it, but instead reference the instance that's already loaded. After all, by the time this code is called, ruby has been running for a while, and certainly will have msvcr90.dll already loaded.

But even using GetModuleHandle wouldn't work, so I started thinking it had something to do with where I was loading the library from. Previously I wasn't specifying a path to the library, just giving the name. Now I tried giving the full path to the dll in the sxs folder, and for whatever reason, it worked!

So I wrote a little program to call EnumProcessModules on my ruby instance, find the full path of the dll I need, and print it out. I parsed this output and passed this full path to Win32API.new. Life is good again.

DL

As I was searching around I happened to see someone mention that Win32API was being deprecated after Ruby 1.9. That's not cool... I use that! It turns out that Win32API is just a wrapper around a Ruby library called DL. And somehow that link is the only reasonable reference/guide for it that I've been able to find. How crazy is that?

DL is cool. You already saw how I was loading the functions before. Now see how it's done after, with DL.

module CRT
    extend DL::Importer

    # ...

    modpath = Utilities.findLoadedModuleFullPath("msvcr90.dll")
    dlload(modpath)

    extern("int _get_osfhandle(int)")
    extern("int _open_osfhandle(int, int)")

    #...
end

# and to call them:
def Hotboot.handleFromFileDesc(fd)
    return CRT._get_osfhandle(fd)
end # end function handleFromFileDesc

def Hotboot.fdFromHandle(handle, flags=0)
    return CRT._open_osfhandle(handle, flags)
end # end function fdFromHandle

I actually kind of like the new way using DL, especially the part where you just write out the function signature instead of what looks like packed array templates.

20100122

The Great Logging Worker Thread Adventure

In my previous entry, I talked about some ways I tried to increase performance or responsiveness of the MUD. One of the ideas I talked about was putting my logging functionality into a separate thread. I went through a few iterations of it before I finally arrived at a solution that is both fast and preserves the functionality I want. One of the intermediate solutions also had a really cool idea (if I may say so myself), and I want to share that/document it.

It's funny: I was going to call this post "the great logging worker thread fiasco"... but then I managed to make it work, so I had to change the title. I guess that's a good thing.

The justification for doing this was to increase responsiveness. If the logging code is synchronous/blocking, then whenever I log something, the user has to wait for that logging to happen. The user doesn't care! They just want to play. In fact, they don't even know it's happening; all they know is my MUD sucks because it's dog slow.

So I wanted to use threading to handle this in the background, i.e. at a lower priority. It could happen whenever it was convenient. More so, I thought of using a worker thread. Obviously, that link is for C++ and Windows, but hopefully I'm not abusing the term for the idea I'm thinking of. The key point is that the logging worker thread doesn't exit, but instead hangs around waiting for new work to arrive. This avoids the cost of creating and destroying a thread.

Actually, Ruby threads are supposed to be "green threads", not kernel threads, so maybe this particular aspect doesn't really matter. I could take some profiles to verify that. In fact, the fact that these aren't hardware threads could potentially make this entire idea a non-starter. I'd have to do some simulations to verify that too.

Let's first consider the naive implementation.

def DebugOutput.debugOut(msg, cats = LComment)
    trace = caller(1)
    enqueueAction() {
        debugOut_internal(msg, trace, cats)
    }
end # end function debugOut

def DebugOutput.workerThreadProc()
    isRunning = true

    while isRunning
        toLog = nil

        @@mutexEventDoLogging.synchronize() {
            isRunning = @@isRunning
        }

        @@mutexEventDoLogging.synchronize {
            @@condDoLogging.wait()
            toLog, @@toLog = @@toLog, []
        }

        toLog.each() { |prc|
            prc.call()
        }
    end
end # function workerThreadProc

I've skipped some of the innards here that perhaps you can imagine. debugOut is the producer, if you will, and workerThreadProc is the consumer--the thread that sits around (the wait call) and does the actual printing (the prc.call() call). What's wrong with this code? Well, one major and a few secondary problems.

To see the major problem, we need to look first at what the logging output actually looks like.

connec@20   accept              Waiting to accept a connection.

There are four parts here: filename ("connection.rb", truncated), line number, function name (accept), and the actual message being logged. The first three of these are generated automatically in the logging code. It's useful to have, to quickly see in the log file where a message was logged from to go look at the code in that file. Herein lies the problem, though.

Look back up to the code and notice the caller(1) call. It turns out this is really expensive (like 80% of the time spent in the logging code). You might also be able to tell that this work isn't deferred in the worker thread; it's on the main thread before the enqueueAction call. The unfortunate fact is that this is necessary. The caller function grabs a full stack trace at the point where it's called. I use that information to parse out the file, line, and function name for the logs. One can perhaps imagine how that might be expensive. And, since it grabs the stack at the point where it's called, it must be done on the main thread; on the worker thread, the stack will just look like the worker thread's stack.

Try Continuations

So I got to thinking, how can I continue to make the correct caller call, but do it on the worker thread, so it doesn't get in the way? In my recent programming languages course, I learned about something very cool that Ruby 1.9 got: continuations. The "continuation sandwich" description in the wikipedia article is pretty cool. Let me summarize the idea for those too lazy to click on the link. A continuation is basically taking a snapshot of the execution state at some time and storing it in a variable. Later, you can jump right back to that point and resume execution from the next line. But it's not really time travel; everything you've done in the meantime is still persisted.

To apply continuations to this problem, roughly the following steps would take place. I'm not altogether clear on whether this is an example of a coroutine.

  1. On the main thread, create a continuation at the point where the caller call should happen, without actually calling it.
  2. Enqueue the logging call as usual, also passing in the continuation object.
  3. In the worker thread, create another continuation to mark the current place.
  4. Call the first continuation, also passing in the second continuation.
  5. Now we're back in the code main thread (but hopefully still running in the logging thread). It's set up (see #1) so that right now we call caller.
  6. Having procured the caller information, now the main code invokes the second continuation, passing the caller information too.
  7. This moves us back to the worker thread proper and the normal flow of execution.

So yeah... that was confusing. Maybe I should just post the code and annotate it with numbers? OK.

def DebugOutput.debugOut(msg, cats = LComment)
    cont = callcc { |cc| # 1
        enqueueAction() {
            debugOut_internal(msg, cc, cats) # 2
        }
    }

    if cont.kind_of?(Continuation) # 5
        trace = caller(1) # also 5
        cont.call(trace) # 6
    end
end # end function debugOut

def DebugOutput.debugOut_internal(msg, cc, cats = LComment)
    trace = callcc { |cont| # 3
        cc.call(cont) # 4
    }

    # 7
    # write logs using trace
end # function debugOut

The callcc function creates a continuation. It takes a block, in which you receive (via the block's parameter) the continuation object that was just created. The return value from the callcc call is not the continuation object that was just created! The return value is whatever value is passed to continuation.call later. For example, the return value from callcc at point #3 is the variable trace that was passed to cont.call at point #6. The syntax is maddening, but it works.

So if you can puzzle all this out in your head (it took me a while), it's pretty rad. Too bad it just flat out doesn't work. Oh, it works on paper, and even works in the single-threaded case, but if you try to call a continuation from a thread other than the one on which it was created, Ruby just throws an exception saying you're not allowed to do that.

The plane was full of Miracle Whip, but it never left the hangar.

Try __callee__

I was looking through methods and noticed that Kernel exposes a method called __callee__ that tells the current method being executed. So if every call to debugOut included the value of this method at that point, it would at least get me the function name that a logging call was executed in. The problem is, I really don't want to update every single call site with an extra parameter that is always the same code.

I was pretty sure this wouldn't work, but I gave it a try anyway: I put __callee__ as the default value for another parameter to debugOut, hoping it would get executed in the context of the caller, not the callee (debugOut).

def DebugOutput.debugOut(msg, cats = LComment, callee = __callee__())

Nope, this always gets the value debugOut. Oh well.

Think about macros

I started thinking a bit about C++, and the way I'd do this there. We actually have an exception library at work that does have this feature of printing out the site that threw the exception. To make it work, we used the __FILE__ and __LINE__ macros.

There are macro processors for Ruby out there. For instance, rewrite looks interesting. But it doesn't seem to be under active development, and there is some question about its speed. It also has some external dependencies, and after last time, I don't feel like putting myself through that pain just yet.

Try Proc.source_location

At some point along the way, something I'd seen tickled my mind. I remember printing out a Proc object and noticing that it showed the file and line number where it originated. Surely I can programmatically get a hold of this information. Indeed, I can! It's source_location.

But to get at this, I'd need a proc that was created at the right place. We've already established that I can't create an empty one in an optional parameter to debugOut. And the prettiness of the code matters, so I'm not going to create that by hand everywhere.

Instead I decided to turn the log message into a proc itself. Check this out:

# Before:
def DebugOutput.debugOut(msg, cats = LComment)
    writeLogs(msg)
end

# After:
def DebugOutput.debugOut(cats = LComment, &procMsg)
    writeLogs(procMsg.call())
end


# Before:
DebugOutput.debugOut("called this crazy function", LComment)

# After:
DebugOutput.debugOut(LComment){"called this crazy function"}

See, this is syntax I can live with. I don't mind curly braces too much, and this has the added benefit of deferring the evaluation of the string until the proc is called... I think. Best of all, this works. I had to change every single call to my logging to fit the new syntax, but it's great. I'm down to 5 sec in startup time, and the majority of that is spent actually interpreting the source code, not in actually executing MUD code.

Minor problem: synchronization

Way back above I alluded to minor problems with the naive implementation. It turns out that all the synchronization and waiting and signaling I'm doing here is slow. When I eventually did get this threading approach working, I found that it sometimes accounted for over 50% of the time. So I removed it. It's just logging, after all; if I lose a couple log messages I highly doubt I'll miss them.

After removing it, the code for the worker thread and the code to control the worker thread is really simple. You don't want to see what it was before, when I had to avoid pesky deadlocks and stuff.

def DebugOutput.startLoggingThread()
    DebugOutput.ASSERT(!@@thDebugLogging, "trying to start logging thread twice")
    @@isRunning = true

    @@thDebugLogging = Thread.new() {
        DebugOutput.workerThreadProc()
    }
end # function startLoggingThread

def DebugOutput.stopLoggingThread()
    @@isRunning = false
    @@thDebugLogging.join()
    @@thDebugLogging = nil
end # function stopLoggingThread

def DebugOutput.enqueueAction(&procLog)
    @@toLog.push(procLog)
end # function enqueueAction

def DebugOutput.workerThreadProc()
    while @@isRunning
        sleep(0.5)

        toLog, @@toLog = @@toLog, []

        toLog.each() { |procLog|
            procLog.call()
        }
    end
end # function workerThreadProc

One big change is now that instead of being notified that there is work to do, the thread wakes up periodically to check for work (the sleep call in the loop).

I guess the moral of the story is that sometimes you have to try tons and tons of crazy stuff until you finally hit on something that actually works.

20100114

First Performance Sweep

As usual, it's been quite a while since I've updated with anything on-topic. And as usual, I haven't gone forward with implementing any real new features. I had classes and things that took priority for a while, and progress in this last bit of work was very slow for very dumb reasons. The focus for my last "sprint" was performance. It's only fitting that since I now work on performance testing of my product at work that I look at performance of my MUD!

Since I try to avoid premature optimization whenever possible, I tried to use profiling techniques to figure out what was making my startup performance so slow. I had a suspicion already, but I wanted proof.

Profiling in Ruby (on Windows)

Let me just say: profiling in Ruby kind of sucks. Or maybe it's just on Windows that it sucks. I mean, Windows is already a second class citizen in the Ruby world, but still...

The state of the art in profiling in Ruby as far as free stuff goes seems to be something called ruby-prof, but don't let that fool you; you don't want to try to get ruby-prof to work on Windows. Instead, you want to get jeremy-ruby-prof, which is some dude's attempt at fixing bugs in ruby-prof and getting it workable on Windows. Don't get me wrong. It is workable, but I have repressed memories of having to bludgeon the build process to get it to finally compile. That is actually a theme I'll return to later.

There are actually other tools out there. I tried to get perftools.rb to work (unsuccessfully). I could never get it to build properly. It looks like it might have been updated since I tried, though, so maybe eventually I'll give it another chance.

You could compare this with profiling C++. It's night and day. C++ has a lot of great tools like Valgrind, VTune and XPerf. I suppose the usual response to this is either a) Ruby is still in its infancy, or b) because Ruby is in its infancy, first it needs to work before we should worry about it working fast.

At work we have some really good instrumented profiling tools, along with visualizations for the profiles that allow things like comparing two profiles against each other and finding where they diverge. If anyone knows about any such tools available widely, I'd love to know about them. ruby-prof has extensibility for writing out a profile into any output format. If any good UI for profiles exist, I can try writing a printer for that profile format.

Bottlenecks

So I was finally able to get the Ruby profiler working, and I took some traces. I found two pretty bad bottlenecks, one surprising and one totally unsurprising. The unsurprising one was my XML loading implementation. I already knew that I made too many passes over too many objects. The somewhat surprising one was my logging infrastructure. I found over a quarter of my time spent in just logging. For something purely aside from the game proper, that's pretty unacceptable.

Logging

A while ago I revamped all of my logging code so that I could assign categories to each logging call and allowed for adding filters that would redirect different categories to different log files. It turns out I introduced some components in this that slowed things down a lot.

Chief among them was a bitfield class that implemented a bitfield of arbitrary size. The way I used this is that when I made a logging call, I would pass an array of categories that the log message belongs to. Then the system enumerated a list of registered loggers and asks each one if they would like to log the message. Any that answer, "yes" get called again to do the actual logging.

The first part of this that I noticed was really slow was the implementation of the bitfield. Well, I'm sure the implementation itself was just fine, but I realized I don't need all the flexibility of an arbitrary size bitfield. I decided instead just to limit myself to 32 different categories, and store my category mask in a single number. By replacing all of the array handling from the bitfield with simple binary operators, it sped things up by about 30%.

The second part of this that I noticed was slow was a call to Kernel#caller. I use this to extract the file name, function name, and line number of the site where the logging call was made. I haven't yet found a good way to get around this cost.

I also noticed that I had set IO#sync = true, which causes each write to have to be flushed to disk. Turning this off helped reduce the time spent in the write call, since it's instead done less often and probably in the background by the OS.

Once again, as I write this entry, I think of things to try. Talking about how something as ancillary as logging affected responsiveness led me to think of other ways to gain responsiveness. I spent some time investigating doing my logging in a worker thread, and I'll discuss this spectacular failure in another post.

XML Loading

Man, I could write an entire entry just about this, but instead I'll cram it all in here. There are often many ways to tackle a performance problem. One way is to use a more efficient algorithm. That's not what I did here with my XML loading. I still make too many passes over the data. Oh well. What I did do is swap out the XML library I used.

I had been using REXML previously. It comes with all Ruby installations, so that's cool, but it's also pretty much the slowest XML library available for Ruby, at least according to various independennt sources and some articles. I decided to try out libxml-ruby, which is a set of bindings for libxml.

As with all Linux/Unix originating libraries, there are numerous dependencies that you have to resolve. I don't even remember now how many weird other libraries I had to find or build in order to get this to work on Windows. I guess on Linux this is all a cakewalk, but I'm "pampered" by now.

The speed is the real deal. After translating all the XML calls in my code from REXML to libxml, I got a 50% speedup, and this is on top of the logging work I did before. My total MUD startup time went from about 20 seconds to about 10. Loading a character carrying a bunch of nested objects also sped up very noticably.

libxml trickiness

But all was not totally simple with libxml. There is a peculiar phenomenon that manifested with it that I had to handle with care. If you look at the documentation for Document#find, it mentions that after you do any xpath query, you have to manually nil out the xpath result object you received and manually run the GC, or you risk crashing.

To explain it a little more, when you call find, which performs an xpath query, it takes a reference to the document from which it came. This is necessary to prevent the GC from freeing the document while the xpath result is still in use. Then, if you are done with the document and the xpath result at about the same time and release all references to them both, the GC will collect both of them probably in the same pass. The order in which it collects them is indeterminate. If it collects the xpath result first, then it's fine; but if it collects the document first, then the xpath result, which has a reference to the document, will access freed memory and crash.

I didn't much like the idea of running the GC manually all the time, so I came up with what I think is a pretty neat trick. It is based on the assumption that XML documents and xpath objects on my MUD are short-lived. That is, they're created, accessed for a little while, and then thrown away. I call my trick "document tracking."

In the first part, I make sure that any document creation is done through a single, central function. In it, I make sure to track the document object I return.

def XmlGeneral.createNewDocument(str = nil, isString = false)
:
    doc = LibXML::XML::Document.new()
:
    trackDocument(doc)
    purgeUnusedDocuments()
    return doc
end # function createNewDocument

def XmlGeneral.trackDocument(doc)
    @@docs.push(doc)
end # function trackDocument

Tracking is pretty simple, as you can see: I just keep track of the documents that have been created. Obviously, I don't want to keep them forever. The purging is the interesting part.

def XmlGeneral.purgeUnusedDocuments()
    ids = @@docs.map() { |d|
        d.object_id
    }

    # should free most or all of the objects dependent on documents
    GC.start()

    @@docs = []

    ##########
    # notice how this proc here is empty. we have to do this to get the
    # object space to update itself from what's actually present. otherwise
    # the _id2ref calls below will not work
    ##########
    ObjectSpace.each_object(LibXML::XML::Document) { |o| }

    ids.each() { |i|
        begin
            @@docs.push(ObjectSpace._id2ref(i))
        rescue RangeError
        end
    }

    DebugOutput.debugOut("doctracking: #{ids.length} objects before, #{@@docs.length} after\n", LComment | LLoading | LSaving)
end # function purgeUnusedDocuments

Let me walk through this. The strategy is to grab all of the object ids for the documents being tracked, then start the GC. This will free any xpath results that are not being used. These are the problematic ones I talked about above. By the end of this, the only xpath results left should be ones legitimately in use.

Then I throw away all of my references to the documents and try to get them back again, using _id2ref, which takes one of my previously saved object ids and tries to fetch the object for it. If the document is not being used anymore by anyone, then I won't be able to get the instance again (RangeError). If it's still in use, I'll get its instance back, still being tracked for next time.

Essentially, I'm doing the same thing as naturally happens--freeing documents and the xpath objects that refer to them--but I've split up this process into two steps to eliminate the nondeterminism.

In practice, this works pretty well. Every several documents I go through purges a few. I didn't see it ballooning upwards or anything. From my code snippets you may notice that I'm purging on every document create. I could very easily tune this to purge every n creations instead, which is probably what I'll do eventually.

At some point I will go through and actually make the XML loading algorithm faster, probably by building some kind of dependency graph in one pass and making one more traversal in such a way that I don't encounter any unfulfilled dependencies. That's the cause for multiple passes in the loading code currently.