RAID Hard Disk Replacement Guide

One of the core gains of RAID is protection against data loss in the event of hard disc failure. When failure does occur however it's important to ensure that the act of replacing the failed drive does not itself risk loss of data or create outages. This article provides our guidelines for the best way to safely restore a degraded RAID array.

Warning about diagnosing failed discs

In the event of ANY sign of hard disk failure, we advise you swap that suspect disk out as soon as possible.

Do NOT attempt to perform any diagnostics or testing of the failed drive. Simply accept that the disk is in a suspect state and arrange to have it replaced as soon as possible.

Diagnostics and testing on a failed drive should only be done once the drive has been removed from your array and server. This is to ensure no accidental data loss happens due to user error.

General Information

All modern hard disks reserve parts of the disc for the remapping of bad sectors. If an error occurs on a sector write, the drive firmware will (if enabled) automatically (and transparently) remap the sector to a good part of the disc and continue as per normal. If there are no free good sectors for remapping, an error will be returned to the disk controller.

Sector defect mapping is part of the SCSI & SAS standards. IDE defect management magically happens. It doesn't seem possible to retrieve the defect list (unless there are some manufacturer specific low-level tools in existence).

Whilst many errors can be corrected by performing a sector rewrite or a low level format, thus obviating the need to swap out the hard disk, it has been demonstrated that performing diagnostics and testing on a live production server complicates things and can lead to data loss if the procedure is not followed precisely.

Thus it is our current policy to recommend swapping out the device and performing testing with another machine.

General Procedure

  1. Verify that there are current and complete backups for the server. If there are no current or complete backups, then get that sorted out first.

  2. Determine which disk failed. It is absolutely imperative that you do not stuff this step up. Record these details about the disk:

    • Device name
    • Model
    • Disk serial number
    • Warranty or Asset tag for your server
  3. Determine which RAID arrays that the failed disk is a member of (it may be a member of multiple arrays)
  4. Remove the failed disk out of all of the RAID arrays which it is a member of.
  5. Schedule a time with Anchor to remove the failed disk from the chassis. We recommend this be done when your server is not under peak usage, as we have often seen problems with a hard disk being swapped causing a SCSI bus lockup or controller problem resulting in a reboot being required.

    • We can do the drive swap for you, or you can come on-site and swap the drive yourself.
  6. For software RAID, if the failed hard disc is the first boot drive, you may have complications (especially if a cold swap is required). Ensure all secondary boot loaders are installed and that you have rescue media with you.
  7. Swap out the failed disk with the new disk.
  8. Label the old disk as failed with a post it note.
  9. Partition the new disk as required.
  10. Add the new disk to the RAID arrays as applicable and start the re-silvering process.
  11. Repair the boot loader if applicable.

Specific Procedures

More specific procedures are provided for certain RAID configurations:

See also:

References/External Links