Booting Windows

I originally started writing this MUD on Linux. Slackware, actually, because I had a lot of free time and hated myself. When I moved to Gentoo on MS Virtual PC running on my Vista box (after my actual Linux box died and I got fully brainwashed), I was pleasantly surprised that, aside from some of the Ruby libraries like Facets, I didn't need to change my code at all for it to still work. I suppose that's not terribly surprising; it's a little like writing a program on Windows 2000 and being pleased when it works on Windows XP.

When I installed Ruby on my netbook, I was really surprised when the MUD code basically just worked there too. There were some minor issues with different error codes being thrown from socket functions, but once I caught and handled those the same way as their Linux brethren, I thought I was in the clear.

It would be funny if that were the end of the post, wouldn't it?

Recall the features I implemented around coldboots and hotboots. The coldboot command lets a user basically shut down and start up the mud. The hotboot command did something much more interesting - it let the user restart the mud and pick up any new code changes without kicking off any users. Turns out the latter has issues on Windows.

Hotboot is pretty interesting, and it took me a while to narrow down the problem. This one may actually be a problem in Windows, but I need to get confirmation on that. Here's a very rough outline of what the hotboot code looks like.

First, the portion to save all characters currently connected:

charConnections = Hash.new()
chars.each() { |char|
    charConnections[char] = char.connection.sockConnected.fileno

# persist charConnections to a file

# exec() is the magic. it restarts the mud without closing file descriptors

Then the portion to restore it:

hashCharFD = Hash.new()
# restore from file

hashCharFD.each_pair() { |charName, fd|
    sockNew = TCPSocket.for_fd(fd)
    # hook up socket to character

I've highlighted a few functions there. Those are the functions that do all the heavy lifting, and not surprisingly, the functions that are implemented differently on different platforms.

The first thing I noticed with my simple implementation is that after a hotboot, my character would seem connected, but wouldn't see any output from the MUD, and the MUD wouldn't see any output from it. Somehow the sockets weren't getting hooked up properly.

Compared to when I first wrote this code, I know a little bit more about how OSes work, or at least Windows. I now know that the "magic" I alluded to above is really just allowing inheritable handles. exec seems to be doing it, but it doesn't call into CreateProcess, so just to eliminate variables, I switched the implementation to use spawn instead, and pass :close_others=>false. I wanted to verify that in fact handles were being inherited.

Of course, I used my favorite debugger, windbg, for everything. Here's a bunch of debugger spew.

Breakpoint 1 hit
0:000> gu
eax=000003c0 ebx=7ffd9000 ecx=01424a20 edx=0000001e esi=00000000 edi=00000000
eip=76466cc8 esp=0022f118 ebp=0022f118 iopl=0         nv up ei pl zr na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246
76466cc8 5d              pop     ebp
0:000> !handle 3c0 ff
Handle 3c0
  Type          File

OK, so that handle represents the connected socket. I would expect it to also be there in the child process. Interestingly, at this point Ruby's fileno method that I was using above returns something like 0x54, which is totally different than 0x3c0.

Breakpoint 4 hit
0:005> kP1
ChildEBP RetAddr  
0267ee00 100a7a92 kernel32!CreateProcessA(
   int bInheritHandles = 1,

Confirmed handles should be inherited. Now, in the child process:

0:006> !handle 3c0
Handle 3c0
  Type          File

Well, there it is. Why the heck isn't it being connected properly? I started to suspect something in Ruby's implementation that depended on file descriptors rather than Windows HANDLEs.

I used the very good MSDN Winsock samples to help whip together a quick version of the hotboot code in C++ for comparison. My test app calls accept, then launches another process that writes back to the same handle number from the accept call. And not too surprisingly, it just works as I'd expect. So the Ruby implementation is doing something else weird.

Since I don't have source for Ruby's library, it took putting the for_fd call in an infinite loop and breaking in the debugger and examining the stack to figure out what it's doing to convert between the HANDLE and the file descriptor.

Of course, if I'd known at the start about the existence of _get_osfhandle and _open_osfhandle, I could have spared myself that trouble. It seems that fileno calls _get_osfhandle on the socket returned from accept, and for_fd does the inverse, calling _open_osfhandle on the fd to turn it into a HANDLE.

So I changed my implementation of my C++ app to mimic this, and it still doesn't quite work. I was able to get the child process to write data to the socket, but not read from it. It seems _open_osfhandle on a socket may not work exactly as advertised. This is the part that might be a Windows bug, but I haven't confirmed.

So now what? Well, sadly, I basically give up. Maybe if I find a proper implementation that works on Windows, I could consider contributing to the Ruby source and get this fixed properly. I wonder if anyone else has run into this problem before.


Aparna said...

i have no idea what you're talking about but this reminded me of how we'd yell out "THE END" in the middle of movies at really awkward points

Post a Comment