20110515

I'm getting too old for PCs

Well, let me qualify that a bit: I'm getting too old for going on adventures and screwing around with my PC. I'm starting to realize that I just want my PC to work. I don't want to have to babysit it. I don't want to have to break it open, install cards, fiddle with jumpers on the motherboard, tweak settings in my BIOS, and so on. I just spent a bunch of time and money I could have instead used in doing something else to get my years-old computer limping along again. I made some stupid mistakes, and since I haven't seen appropriate coverage of them online, I'm going to write up what happened, in the hope that someone else will benefit from my time wasteage.

I hate IDE

The trouble started about 3-4 months ago, when I noticed one of my hard drives was going slow. How did I notice? Well, when I tried to play video off of the drive, it could barely maintain the framerate, and when I tried to copy files off of it, I found that my CPU was pegged servicing interrupts. It also turned out that my DVD drive wasn't working. It wouldn't play any CDs, and accessing it was very slow - about 15X slower than it should be.

This particular drive is one of two IDE/ATA devices (as opposed to SATA) that are holdovers from my older computers, so from past experience I knew that this was probably an issue with DMA.

IDE drives can operate in basically two different modes: DMA and PIO. In DMA mode, the majority of the work required for reading from or writing to the drive is offloaded to a separate piece of hardware: the DMA controller. This allows the CPU to be basically idle while it waits for disk accesses. In PIO mode, the DMA controller is not used, and the CPU has to take an active, hands-on approach whenever it needs to access the hard drive. Not only is this slower, but it also hogs the CPU's attention.

Sure enough, when I went into my device manager and looked at the advanced settings of my two ATA channels, the devices had slipped somehow into PIO mode. This is weird, since they had definitely been working in DMA mode before.

Again from past experience and research into this problem, I suspected a bad cable. IDE devices are notoriously finicky about how they are wired up, configured, and the type and quality of cable used. It's very easy, if you're not careful, accidentally to use the wrong kind of IDE cable, which will result unfailingly in PIO mode. But I had been careful about this before, so it surprised me that it suddenly stopped working.

So, dutifully, I cracked open my computer case, replaced the apparently fine looking IDE cable with a spare I had around, and booted up again. Or, well, I tried to boot up again. This leads me to the next problem.

Avoid dubious motherboards

My motherboard is kind of crappy. It's an Abit AB9 Pro. Sure, the specs on paper are fine, but it has been buggy in some way or another for a while. Now, I don't know for certain if it's the motherboard, but for a long time it's had issues booting up. It'll turn on and off repeatedly. It'll turn on and do nothing. I'll have to power cycle it several times to get a good boot. And of course, this started happening to me real bad while I was fixing my hard drives. I probably power cycled it 20 times in a row before giving up.

So I went to the web site for the motherboard to look for BIOS updates. I found that I was using BIOS version 15 while the latest released was something like 22. Wow! And some of the fixes in the meantime look like they might address my problems! This looks promising.

Flashing a BIOS is always a pain, because it almost always involves booting into DOS, not any kind of Windows. Even though I have a USB floppy drive and spare floppies that work, the Windows "Create MSDOS boot disk" option put so many files on the floppy that there wasn't enough room for the BIOS update itself! So I had to use a USB thumbdrive and mkbt to format it. Cool, I was ultimately able to flash my BIOS.

Now my crazy booting problems were not as bad - it didn't catastrophically fail to even POST every time (only sometimes). But there was another problem: Windows was crashing during startup every time. So, maybe it is pretty much just as bad.

Diagnosing a blue screen

When Windows is doing something, then suddenly stops and reboots without warning, it's usually a blue screen of death, also known as a "BugCheck", named after the piece of code that executes to handle it. I assumed this was a bugcheck, so set about diagnosing it as such.

First things first, upon rebooting after the bugcheck, Windows was nice enough during startup to give me the option to run the "Recovery Environment", which did a bunch of analysis for what seemed like forever, then told me--surprise!--that it found a bugcheck.

While in the recovery environment, I was able to open a command prompt and copy the MEMORY.DMP file (which contains the crash dump information - a blue screen of death is just a special type of crash, usually in a driver) to another computer via a thumb drive. There, I opened it in my favorite debugger, windbg.

I'm not necessarily looking for a full set of information to debug the crash; I'm unlikely to know enough of the code here to diagnose it that deeply. I'm looking for which DLL or driver is the likely suspect. So I fix my symbols and get a stack trace:

0: kd> kc
Call Site
nt!KeBugCheckEx
nt!PspUnhandledExceptionInSystemThread
nt! ?? ::NNGAKEGL::`string'
nt!_C_specific_handler
nt!RtlpExecuteHandlerForException
nt!RtlDispatchException
nt!KiDispatchException
nt!KiExceptionDispatch
nt!KiPageFault
dxgkrnl!memmove
dxgkrnl!DxgkCddSetGammaRamp
cdd!PresentWorkerThread
nt!PspSystemThreadStartup
nt!KiStartSystemThread

In bold I've pointed out the likely culprit module. Everything higher in the stack is code in Windows to handle the crash. The memmove function copies memory from one place to another. Here, dxgkrnl.sys is probably writing to memory it doesn't own. That's a pretty wide variety of possible problems.

However, I now have an idea about the module at fault: dxgkrnl.sys! I look it up on the web and find that it could be involved with almost any cause. Some sites do suggest, however, to run a memory test. This I straightaway try to do, using the recovery environment's memory test utility.

It fails! The memory test hangs at 21% every single time! I try swapping out DIMMs of memory, switching their order around, and so on. Still, it fails again and again. I come to the seemingly only logical conclusion: there's something wrong with my motherboard.

Swapping motherboards

I get my buddy to help me pick out a replacement motherboard, and what the heck, I throw in a new power supply too, in case my current 460 watt one isn't enough and was causing some of the boot problems.

Swapping a motherboard is kind of a pain. You have to unplug everything inside the case, then wire it all back up. For this generation of motherboards, it is quite a lot, including tiny audio wires and little jumpers. It took 4-5 hours straight to get everything wired up in there.

The moment of truth! Will it work? I boot up, fiddle with my BIOS settings a bit, and boot Windows.

It crashes. Again. Basically the same way as before. I try the memory test, and it fails at 21% again! I swap around my RAM, to no avail. That was an expensive way to figure out my motherboard wasn't at fault. Wow, now what?

A glimmer of hope

In desperation, I start swapping video cards, unplugging USB devices, etc. etc.. Of course, none of this works. Eventually I'm back in the recovery environment, screwing around with whatever, checking that all my drives are there intact, when I notice something.

I do a dir e:\ (dir lists the files in some folder), and see a Windows directory (so that is my system drive - fine). Then I do a dir f:\ and see another Windows directory. Why are there two copies of Windows on different drives in my computer?!

I get a sinking feeling in my gut. Suddenly, like a bolt from the blue, things become clear. I have been booting the wrong version of Windows this whole time.

Know what you've got in your box

Some months ago, my old netbook died. In an effort to get data off the drive, I pulled out its sole hard drive and plugged it into this computer. When I flashed the BIOS on my old motherboard, it reset all the settings about how the drives in the computer were ordered. As it so happens, this is of the utmost importance.

When a computer needs to boot up, it looks to a special, specific place for something called a "master boot record". This is the information about what operating systems you have, and which drives to look to for them. The MBR is stored at a certain place on "drive 0"--whatever BIOS thinks is the "first" drive in your computer. If this ordering of drives gets rearranged in some way, then your computer will be looking for the MBR in the wrong place. Not that it is difficult to fix, but in this case, it helped disguise a problem pretty badly. Also note that this only becomes a concern if you have multiple physical hard drives in your computer. I have three. So... yeah, susceptible.

So here's how the drives are broken down:

Drive 0 Old netbook hard drive. Contains MBR pointing to E: as Windows
Drive 1 Data drive. No MBR, no Windows installations.
Drive 2 My comp's Windows drive. Has MBR pointing to F: as Windows.

So this whole time I've been booting Windows (and the Recovery Environment!) from my netbook's drive. Of course it didn't work! My netbook was set up, well, for my netbook's hardware. It didn't properly handle suddenly running in a drastically different hardware environment. This also explains why the memory test wasn't working; it was probably trying to use some settings specific to my netbook that weren't compatible with my real computer.

Funny noises? IDE drives AGAIN?

So I took out my netbook's drive, fixed my MBR, and booted into the correct Windows. By the way, the correct version of the Recovery Environment actually fixed this automatically, so nice job there. Hooray! Well, kind of hooray, since there were still problems.

Mainly, neither of my IDE drives (the hard drive and DVD drive that started this whole mess) were found. And I noticed that my CPU was pegged running system interrupts. Finally, the computer was making a funny, high-pitched sound constantly.

At my job, I work on performance problems on programs running in Windows, so I decided to start by tackling the interrupts problem first. I did the first thing anyone should do when something is taking a lot of CPU time: take an xperf trace.

Once again, I'm not necessarily looking for the root cause, rather, the faulty module that can hint at what the real problem is. Looking at the Interrupt CPU summary table, I see... our good friend dxgkrnl.sys and some Unknown modules. that's not too helpful. dxgkrnl.sys, as we established earlier, could be indicative of all kinds of things, so I can't really get any information from this.

I happened upon a different trick, though, which actually did point the finger much better. Instead of looking at the Interrupt CPU summary table, I looked at the CPU Sampling By Process summary table, at the Idle process. This is weird, beacuse usually the system idle process is used to account for time where the CPU isn't doing anything. It actually seems like it is a catch-all for CPU time spent aside from any real process (including idle time).

Interesting. This points the finger at intelppm.sys, nvlddmkm.sys, and ataport.sys. I know from past experience or looking them up that these are, respectively, an Intel motherboard driver of sorts; the nVidia graphics driver; and the Windows IDE driver.

That last one was particularly interesting to me, since I was also having problems with my hard drives not showing up. Plus, I had already updated my drivers for my motherboard and video card, so I didn't expect those to be the issue.

Screwing around with IDE drives

Another funny thing about IDE drives: not only does the cable you use to connect them matter, but also the manner in which it's wired up. An IDE cable has three connectors: one for the motherboard, one for the "master" drive, and one for the "slave" drive. You set pins on each hard drive indicating whether it is the master or slave, and your cable has to connect up to match that.

By the way, the whole master/slave thing, as I understand it, is to tell the drives something about what order they're allowed to talk in, since they're sharing the same cable. Or something like that.

Anyway, the way I'd connected my drives was correct for my previous motherboard, but somehow I had to reverse the ordering of the master and slave connectors in the cable for this new one. Why can't they standardize these things? As soon as I fixed that, all my problems evaporated. No high CPU Interrupt time, my drives showed up correctly, and so on.

Even the high-pitched whining stopped after a couple days. Maybe the power supply needed to get broken in or something?

So my computer is working now. It feels like a hollow victory. If I'd been more attentive and had more perseverence, I wouldn't have jumped at buying a new motherboard. But now I know a bit better, and hopefully you do too.

0 comments:

Post a Comment