Recovering from Windows software RAID failure
Array health states
RAID volumes can be in one of four states:
- Healthy - the default. Everything is A-OK.
- Rebuild - the system is rebuilding the array, and is thus at risk of data loss until the array has completely rebuilt. This will occur when the array is initially built, and while recovering from a failure.
- Failed RAID - redundancy has failed, but the array is recoverable. You will need to replace the failed disk, or reactivate a disk with errors, to rebuild the array.
- Failed - the array has failed. No rebuilding is possible and you have lost all data on the array. Replace the failed hardware and recover the data from backups.
There are generally only two states you will find a machine in when a RAID error has occurred: a completely failed disk, or a disk with errors. In the case of errors being detected, the mirror will still be marked as healthy, but due to the errors Windows will treat the array as At Risk. In either case, you must replace the sub-optimal disk.
Log into the machine and enter the Disk Management console.
Right-click on the failed disk and select Remove Mirror (RAID-1 only).
- Select and remove the failed disk from the array.
Enter the Computer Management console and open the Windows Device Manager. Uninstall the disk that has failed.
- Physically remove the disk and replace it with an equivalent model.
Enter the Disk Management console and Scan for changes. Roll 1d6 for a Detect Disk skill check, you need at least a 3 to pass. If you fail, you'll need to reboot the machine in order for it to pick up the hardware change.
Right-click and Remove the old (ghost) disk from the Disk Management console if it is still visible. The disk will be marked with a Missing label.
Important: you must zero out the first sector of the new disk once the hardware has been picked up by the operating system. Windows will not install its bootloader onto the disk if it finds old GRUB installations or other random data on there. Do this even for arrays that you are not going to boot from. Grab a copy of Boot Sector Explorer (BSE), and use it to write the zero file (512 bytes of 0x00) to the new disk. Dismiss any Invalid boot sector warnings by clicking the Yes button on the warning dialogues. Still using BSE, verify that the first sector of this disk is completely zeroed out.
Click Rescan disks from the Action menu in the Disk Management console to refresh the operating system's cached perspective of your disks. The disk you zeroed out earlier should now be marked as Unknown and Not initialized. By closing and re-opening the Disk Management console, you should be able to trigger a disk initialisation wizard. Dismiss the wizard.
Still using the Disk Management console, right-click on the new disk and select Initialize disk. Convert the disk to a Dynamic Disk.
- Recreate the array if you had removed it earlier (applies to RAID-1). Add the new disk into the array.
(Boot arrays only) Using BSE, verify that the MBR has been correctly initialised with NT bootloader code. You are looking for the strings Invalid partition table and Missing operating system. Compare the MBRs of both disks; there will be a 32-bit disk signature field that differs, but not much else.
Open C:\boot.ini in Notepad and ensure it reads like the following:
[boot loader] timeout=30 default=multi(0)disk(0)rdisk(0)partition(1)\WINDOWS [operating systems] multi(0)disk(0)rdisk(0)partition(1)\WINDOWS="Windows Server 2003 Standard" /fastdetect /NoExecute=OptOut multi(0)disk(0)rdisk(1)partition(1)\WINDOWS="Windows Server 2003 secondary plex" /fastdetect /NoExecute=OptOut
Wait for the array to rebuild. Server 2003's I/O scheduler isn't as clever as what you get with Linux — performance will be reduced until you reach full redundancy.