Live Server Hard Disc Drive Upgrade Process

This documents the procedure for changing hard discs on a live dedicated server to increase total system capacity in a server utilising RAID 1 technology.

Also see Ext3 to LVM conversion process

Prerequistes

Commands needed:

  • mdadm
  • rsync
  • sshd
  • pivot_root
  • chroot

Procedure

  1. Obtain new hard discs of appropriate type, capacity and speed. For calculating capacity needed take into account existing hard drive space usage (user data, OS (software + swap space)).
  2. Assign Asset ID to new discs. Label discs with 'AID' and the asset ID on top of the drive in texter only.
  3. HOT SWAP DISCS ONLY: Find spare drive cages that match the dedicated server that the new discs are for. Place new discs into the drive cages.
  4. Format and check new discs: On a spare managed server:
    1. Place new discs in dedicated server and power-on server.
    2. SCSI discs:
      1. For Adaptec SCSI BIOS, press Control-A on bootup to go into SCSI BIOS.
      2. For each new drive:
        • Perform low-level drive format.
        • Perform surface scan.
    3. IDE discs:
      1. For each drive: badblocks -vw /dev/hdX
    If any disc fails, get it replaced.
  5. Create emergency floppy boot disc for dedicated server: Refer to Bootdisc_creation procedure. Set root password to something.
  6. Contact client at least 24 hours prior to altering hardware configuration and do not perform any work until work is approved. Work MUST be done outside business hours.
  7. Start a temporary text file on your workstation. Copy current partitioning layout, MD device layout, and mount points.
  8. SWAP on RAID 1 is not entirely stable. First check if there is

    sufficient free RAM on machine: eg running free

                 total       used       free     shared    buffers     cached
    Mem:        523864     520752       3112          0      59788     395192
    -/+ buffers/cache:      65772     458092
    Swap:      1991928       1088    1990840
    In this example there is 458092 KiB of memory that can be used if there is no caching/buffers. There is 1088 KiB of swap currently being used. The amount of free memory MUST be at least twice the amount of swap being used. If there is sufficient free memory then swap must be disabled to ensure machine stability during the disc enlargement:
            swapoff -a (Hot swap SCSI Only)
            vi /etc/fstab
  9. KILL smartd . KILL IT AGAIN. If smartd is running when you try to remove the scsi device from the kernel, nothing will happen, everything will think you still have the old device in, and you'll think you've swapped disks on the wrong machine.
  10. Replace 1 hard disc. Start with last hard disc (eg hdb, SCSI ID # 1) First ensure that disc chosen is in a good state: badblocks -vv /dev/X Hot swap SCSI Only:
    1. Set RAID members on second disc as failed and remove from array.
                      DRIVE=sdb
                      MD_DEVICES=`awk '/^md[0-9]+/ { print $1 }' /proc/mdstat`
                      for md_device in $MD_DEVICES
                      do
                              partition=`grep ^$md_device /proc/mdstat | sed -e "s/.*\(${DRIVE}[0-9]\+\).*/\1/"`
                              mdadm --fail /dev/$md_device /dev/$partition
                              mdadm --remove /dev/$md_device /dev/$partition
                      done
    2. Confirm that drive is not being used:

                                      grep $DRIVE /proc/mdstat
                                      mount | grep $DRIVE
    3. Find out SCSI controller, channel, device, and logical unit number (LUN) via kernel boot up messages (/var/log/dmesg) and /proc/scsi/scsi
    4. Tell Linux to remove SCSI disc list of drives on bus

                                      blockdev --flushbufs /dev/X
                                      echo "scsi remove-single-device $controller $channel $device $lun" >/proc/scsi/scsi
    5. Confirm that drive is removed from list of devices

                                      cat /proc/scsi/scsi
    6. Physically remove disc now. Label old drive. Place old hard disc in anti-static bag and place in equipment storage area for wiping/testing queue.
    7. Place new hard disc in machine.
    8. KILL smartd . KILL IT AGAIN. If smartd is running when you try to remove the scsi device from the kernel, nothing will happen, everything will think you still have the old device in, and you'll think you've swapped disks on the wrong machine.
    9. Tell Linux that disc is attached to SCSI bus now.

                                      echo "scsi add-single-device $controller $channel $device $lun" >/proc/scsi/scsi
      (this will take a while (1-2 minutes) to return)
    10. Confirm that drive is now in list of devices

                                      cat /proc/scsi/scsi
    IDE/Non hot-swapable SCSI: Poweroff machine. Open up machine and replace hard disc with new disc. Place old hard disc in anti-static bag. Poweron machine.
  11. Partition new disc to needs of customer. Consult partitioning procedures. fdisk /dev/whatever (NB: Remember to use fd for raid autodetect) cat /proc/partitions
  12. Create failed RAID array from new partitions
    1. Check /proc/mdstat and find available MD device minor number
    2. Create new MD device with a failed member:

                                      mdadm --create --level=raid1 --raid-devices=NUMBER_OF_DISCS \
                                              /dev/md${FREE_MINOR} $partition [$other_partition..] missing
      NUMBER_OF_DISCS is what final value will be (typically 2).
    3. Confirm with /proc/mdstat that new MD device is active
  13. Make new filesystems/swap

                    for md_device in \$new_md_devices
                    do
                            mke2fs -j -L \$label /dev/\$md_device
                            # or mkswap /dev/whatever
                    done
    NB: Ensure labels are different to old ones if using labels in /etc/fstab or /boot/grub/grub.conf Take note of new labels.
  14. Mount new MD partitions
    1. mkdir /newroot
    2. Mount new root:

                                      mount /dev/md${MINOR_OF_NEWROOT} /newroot
    3. Mount all other points:

                                      cd /newroot
                                      for mount_point in $new_mount_points
                                      do
                                              mkdir /newroot/$mount_point
                                              # Check permissions are correct on mount point! (eg chmod 1777 tmp)
                                              mount /dev/md${WHATEVER} /newroot/$mount_point
                                      done
    NB: Be sure to include any bind mounts/other virtual fs's.
  15. Copy data across

                    cd /
                    for mount_point in $old_mount_points
                    do
                            cp -ax $mount_point /newroot/
                    done
  16. Shutdown all services except SSH

                    echo "Doing drive upgrade" > /etc/nologin

    Manually shutdown all services except SSH via:

                            ps aux and service whatever stop

    check with:

                            netstat -ln
    and
                            ps aux
  17. rsync all data

                    for mount_point in $old_mount_points
                    do
                            rsync -avnxX --delete $mount_point/ /newroot/$mount_point/
                            # Check that it does what you think it is doing
                            rsync -avxX --delete $mount_point/ /newroot/$mount_point/
                    done
  18. pivot_root to new root:

                    mkdir /newroot/oldroot
                    cd /newroot
                    pivot_root . oldroot
                    exec chroot . /bin/bash <dev/console >dev/console 2>&1
    if pivot_root not available need to do boot loader and reboot
  19. Removal of processes in old root

                    telinit u (make exec chroot if pivot_root if 2.2.x kernel)
                    service sshd restart
                    login via SSH and logout old window
    
                    # Obsolete echo $hex_value_of_new_root_dev > /proc/sys/kernel/real-root-dev
  20. unmount old root

                    cd /oldroot
                    # To find any strays. There will probably be a whole bunch of kernel threads.
                    fuser -mv /oldroot
                    cat /proc/mounts for info
    
                    mdadm -S /dev/$old_md_devices
    
                    mount /dev/whatevr /oldroot -o remount,ro  # If can't unmount oldroot
  21. fix /etc/fstab to match new devices being used. fix /etc/raidtab
  22. fix /etc/mtab to show new devices as being mounted (needed to fool mkinitrd)
  23. Swap to new drive If you were able to unmount /oldroot then, remove old drive using step 9, otherwise you will need to fix boot loader and reboot in order to remove drive.
  24. boot loader
    1. If grub being used

                                      cd /boot/grub
                                      vi device.map
                                      vi grub.conf
      
                                      $ grub
                                      grub> device (hd0) /dev/X       # Set first BIOS drive to Linux device /dev/X. SCSI BIOS assigns order based on SCSI ID.
                                                                      # Not sure if there is a standard for IDE BIOS.
                                      grub> root (hd0,0)              # /boot is on first partition of first BIOS disk.
                                      grub> setup (hd0)               # Install the boot loader into MBR of first BIOS disk.
                                      grub> quit
      
                                      check boot loader is installed:
      
                                      dd if=/dev/X bs=512 count=1 | strings # this will have GRUB in the output
    2. If LILO is being used grrrr. Much pain. Need to play with root,boot, and device/bios directives. check boot loader is installed:
                                      dd if=/dev/X count=1 | strings
      (this will have GRUB or LILO in the output)
  25. mkinitrd Modern versions of RHL should have no need to modify initrd. mkinitrd says 'All of your loopback devices are in use!' with old versions and /tmp being /tmpfs run script from post-install of kernel
  26. copy partition table:

                    sfdisk -d /dev/sdb | sfdisk /dev/sda

    hotadd new partitions

                    mdadm -a /dev/mdWHATEVER /dev/sdWHATEVER
                    echo 10000000 > /proc/sys/dev/raid/speed_limit_max
    Boot loader again: NB Wait until /boot partition RAID has finished syncing. Mkinitrd reboot for a final check fix up bootdisc rm /etc/nologin rmdir /oldroot /newroot
  27. If kernel upgrades necessary, upgrade, as per instructions in procedures/kernel-upgrade-checklist.
  28. If the partition layout has changed, and the machine is being backed up to Ark, correct the disklist to reflect the new partitioning layout for

    this machine.BR