It’s no secret that we’d rather be working on Linux than Windows here at Anchor. It is, by and large, much more annoying to actually get anything done, but it also just breaks in opaque and unexplained ways. O Windowes, let me count the ways in which you are broken! This is one such problem we ran into yesterday.
Hard drive failure is a fact of life when you run servers, by sheer virtue of that fact that you have hundreds of them. To mitigate the risk and reduce unscheduled downtime, we use Window’s built-in software RAID feature. It’s not an enterprise solution, but it gets the job done. What’s important is staying online and not losing data.
Did I mention that trying to monitor a Windows box is a nightmare? A colleague of mine wrote a script to allow us to keep a watchful eye on Windows RAID volumes, it’s a lifesaver. A recently-deployed machine got a broken mirror, which we were able to act on immediately. We removed the dodgy mirror and prepared a replacement (we always have plenty of spares, of course). Allow me now to re-enact this scene…
Windows (sounding almost efficient): The driver has detected that device DeviceHarddisk1DR9 has predicted that it will fail
Sysadmin: Thanks, Windows, I’ll get right on that. You didn’t say whether that was SMART, or just voodoo, but whatever, it’s good to know.
The bad drive is removed and a replacement installed in the hotswap drive bay
Sysadmin: Okay, Windows, do your stuff. “Scan for new hardware”, please.
Sysadmin: Ahem, Windows, “Scan for new hardware” and find my drive.
Windows: ‘Ey there, chaps. Do what now, you say? AIEEEEGRH!!
The server stops responding entirely, necessitating a touch of the reset button
Needless to say, we’re rather unimpressed, and have to call the customer to let them know why it’s just dropped offline.
A quick check of the logs is in order. It’s also frustrating that there’s no sane way to scroll through log entries in Windows with something like a text editor, or to “tail” a log as it’s updated in realtime.
09:36 – The previous system shutdown at 9:21:23 AM on 8/10/2008 was unexpected.
Okay, it went down at 09:21, which is correct. Now if we look back in time a little…
09:21 – dmio: Harddisk1 write error at block 1953524618 due to disk removal
*sigh* And this is after the disk was removed cleanly…