Understanding the boot process on Linux servers

Intro

Hardware failure is a fact of life when working with computers. When your business revolves around keeping servers running smoothly at all times, the ability to diagnose and fix failures isn't optional. At Anchor we take pride in our analytical skills, and use them to make things work.

This information is great when you have time to diagnose a faulty server, but it probably won't help you when a frontline machine falls over. The best thing to do is get everything running in a spare system. You do keep spares on hand, don't you..?

We'll run through the typical bootup sequence of one of our rackmount dedicated servers running Red Hat Enterprise Linux. The exact configuration is variable, but these are typically equipped with two CPUs, at least a few gig of RAM and SCSI hard drives in a RAID-1 configuration. We'll discuss the possible failure modes at each step and what can be done. The early stages of the boot sequence assume some knowledge of computer hardware architecture at a low level.

CPU/Memory

  • Assuming you have a working CPU, it will start up in a "clean slate" state and start executing instructions from the BIOS ROM.
  • All motherboards capable of taking more than one CPU will have numbered sockets. For the system to start, there must be a CPU in the first socket ("CPU 0").

  • The motherboard is hardwired to initialise the first CPU. The other CPUs are brought up by the first CPU, hence the requirement for a working CPU in the first socket.
  • CPUs installed incorrectly (physically) will not be available to the system.
  • CPUs with a poorly-installed heatsink and/or fan will either not work at all or will trip the thermal shutdown sensors. While this can possibly damage the hardware, nowadays most CPUs should attempt to throttle down their speed to prevent this.
  • CPUs of different models will generally not work together in a multi-CPU system. The system may refuse to boot with the non-matching CPU, or will refuse to "start" the extra CPU/s. Check the model number and sSpec for more details, if you're using Intel gear.

  • Memory on SuperMicro servers (and most other brands) must often be placed in specific slots. They can be very picky! Incorrectly pairing the slots or inserting into the wrong ones can cause less than expected (or no) memory to be available to the system. This is due to the complex memory configurations available on server motherboards, allowing nifty features such as sparing (failover memory if failure is detected) and mirroring (like RAID-1 for your RAM).

BIOS

Description

  • N.B. Everything in a PC is an address on a bus. Addresses do not overlap. Addresses generally aren't "assigned" so much as "claimed" (is relevant for BBS, etc).

  • BIOS is a chunk of machine code sitting in a ROM, though nowadays it's flash memory.
  • The start of the BIOS resides on a bus at a fixed memory address. On PCs, this is 0xffff0 (just before the 1MiB mark).

  • It's an architectural detail, but the BIOS sits on the LPC bus, which carries all the simple I/O like keyboard+mouse, serial, parallel, etc. The system is hardwired such that the BIOS is at the right address.
  • The BIOS is raw executable machine code. During bootup it does a number of things:
    • Self-tests on hardware, chipset, etc. POST

    • Device enumeration
    • Device initialisation
  • As a program (executable code accessible in "memory"), then BIOS provides two general types of functionality:
    • POST (as above)
    • Runtime services to the OS. This is an API for stuff like device access (ie. Basic I/O). It's sufficient to help the OS boot, then the OS does direct access via its device drivers once they're loaded.
  • The BIOS handles user customisation by keeping settings in the CMOS (the small chunk of volatile memory attached to the motherboard).
  • The part we care about most is that it detects bootable devices and boots an OS off an appropriate one.
    • Note: When using hardware RAID devices, make sure you set the bootable flag on your /boot file system.

Timeline

  1. Onboard clock generators and CPU come up in a known state. Assume it's running correctly.
  2. CPU jumps to 0xffff0 and starts executing.

    1. This is at the top of the first meg of RAM, just 16 bytes away. There's not enough room here for the whole BIOS, so this is invariably a machine-code jump (JMP) to another memory address where the BIOS code really is.
  3. As soon as convenient, BIOS copies itself to a location in RAM, then fiddles the program counter so execution continues from RAM, as this is faster than hitting the ROM.
  4. BIOS performs system selftests; CPU, DMA, timers, PICs, etc.
  5. Buses are given a hardware reset to initialise them.
  6. Looks for a video BIOS starting at 0xc0000. A valid BIOS has the magic number at the start 0xAA55 (bytes are reversed to 0x55AA due to little-endianness).

    1. If present, initialises the video BIOS. This is when your screen turns on and displays the model no. and logo, etc.
  7. Starts scanning the space from 0xc8000 to 0xdf800 (meaning the BBS area ends at 0xe0000) in a similar manner on 2KiB boundaries, looking for other BIOSes to initialise. This is when SCSI cards and PXE-bootable network cards are discovered. Control passes to the secondary BIOS which can initialise devices, or run interactive frontends (eg. SCSI BIOS, low level format tools). The secondary BIOS registers bootable devices with the main BIOS via the BIOS Boot Specification API. Once each secondary BIOS has run, control is passed back to the main BIOS to continue scanning or booting.

    1. Before each option-BIOS is run, a checksum test is run on the BIOS before passing execution to it. If any of these fail, it should produce a "helpful" error message to the screen.
  8. Check the memory location at 0x472. If 0x1234 is found, this is a warm boot, and further POST tests are skipped

    1. During POST, status updates are written to "port 80". On some motherboards, this is a 2-digit 7-segment LED display that can tell you what the system is up to, and where failures are occurring.
  9. The BIOS is now ready to think about starting an OS. Based on default or preferences stored in the CMOS memory, the BIOS pick a device and attempt to boot it.
  10. To boot a device, the BIOS copies the first sector (512byte chunk) to memory location 0x7c00 (31KiB in memory) and starts executing it. Before considering it a valid boot sector, the BIOS checks that it ends with the standard 0xAA55 signature.

  11. For things like PXE-booting there's more complexity involved, but the important point is that this 512byte boot sector is an executable chunk of code that can load the next step in getting the system booted. We continue assuming that the chosen bootsector belongs to a hard drive. This bootsector is called the Master Boot Record, and is relevant to the whole disk.
  12. A standard HDD boot sector has 446bytes of bootloader code, 64bytes of DOS-style partition table (4x 16bytes per entry) and 2bytes of 0xAA55 signature.

  13. This bootloader code varies between OS, but its purpose is to find the first bootable partition on the disk and invoke its boot record (a partition has a bootsector as well, correctly termed a Volume Boot Record). For old DOS-type systems, the partition table would have one of the four partitions flagged as "Active". The MBR code would use the data in the partition table to find its location on disk and invoke the VBR.

Reference links and related articles

GRUB

GRUB is a boot loader, and is the first piece of software to run when a computer is started. Its role is to transfer control to an operating system kernel. The kernel then in turn initialises the rest of the operating system.

GRUB consists of 3 stages.

Stage1 is stored in the MBR of the physical boot media. The MBR is the first 512 byte sector of the device. This limits what features can be provided by the stage1 GRUB install. Stage1 can load stage1.5 or go directly to stage2 of GRUB.

Stage1.5 is located in the first 30 kilobytes of a device following the MBR. The role of stage1.5 is to load stage2. Stage1.5 exists as a convenience. Stage1 is only smart enough to point to a disk address where stage1.5 can be found. Stage1.5 groks filesystems, so it can find stage2 even if its on-disk location changes. If stage1.5 is not installed, stage1 will have to point directly to stage2 and has to be kept updated, much like lilo.

When installed, stage1.5 lives in the spare 30KiB of space between the MBR and the start of the first partition. This is possible because old-school fdisk leaves the rest of the first track free (63 x 512-byte sectors per track, one sector used by the MBR).

Stage2 is where the boot loader presents an interface to the user. Typically this is a menu that presents a list of available kernels and boot options saved in the configuration file /boot/grub/grub.conf. Stage2 also provides a command shell for users to make edits to parameters prior to booting the kernel. All commands available in the command shell can be used in the configuration file and vice-versa.

The boot process can be interrupted at any stage by many problems, a summary of these are:

  • Stage1 Errors
    • Grub is contained in the MBR of HDD disk (First 512 bytes).
      • Typically any errors reported in stage 1 of GRUB's start up will be due to physical media problems
      • Physical media problems could be caused by:
        • Failing boot media (Run smartctl/badblocks on it in another machine if it is a HDD);

        • Bad cabling or poor connection (unlikely but possible);
        • Corruption of the MBR (Boot into a recovery OS and reinstall GRUB).
  • Stage1.5 Errors
    • All of stage1 possible causes;
    • Stage1.5 image has been damaged, some RAID controllers are known to write data to this space;
    • Corruption of the stage2 image;
    • Filesystem corruption.
  • Stage2 Errors
    • Stage2 can be interrupted by any possible causes of a previous failure;
    • Configuration errors in /boot/grub/grub.conf are primary cause of stage2 start-up failures;

    • Physical boot drives have been moved in the chassis causing GRUB to be referencing incorrect drive names;
    • A newly installed kernel package may have incorrectly edited the GRUB configuration file;
    • A complete list of error codes can be found at http://www.gnu.org/software/grub/manual/html_node/Stage2-errors.html

    • Many of the Stage2 errors can be fixed by loading the GRUB command shell, by pressing c when at the GRUB boot loader.

Reference links and related articles

Kernel

Information on the kernel and initrd goes here

Initscripts

Init

Once the kernel has loaded it launches the init user process. Its config file is /etc/inittab. The first thing init does is set the runlevel, which should be 3 on any of our RedHat-based machines. After this it will launch /etc/rc.d/rc.sysinit which takes care of the higher level startup functions.

Stuff that can go wrong

  • Deleting inittab or rc.sysinit
  • Screwing up inittab or rc.sysinit

rc.sysinit

This bash script (which you should never touch) does a lot. This has been edited for space due to the hundreds of boot-time parameters it sets up and the scripts it runs.

  • Sets the hostname, clock and network config
  • Mounts /proc and /sys

  • Probes USB controllers
  • Creates a huge number of functions used in init scripts
  • Checks the state of SElinux and assumes Enforcing mode if undetermined
  • modprobes everything and loads additional modules
  • Sets up LVM
  • Runs a fsck if forced

Stuff that can go wrong

  • In EL5 it looks like RedHat sanely uses /bin/bash as the shell instead of the more commonly symlinked /bin/sh. Mucking up the system shells would cause the script to be unusable.

  • There are many files sourced from /etc/sysconfig. Removal/buggering of these files could be anything from trivial (eg. spamassassin) to a datacentre trip (evil low-level stuff).

  • /etc/sysconfig/network-scripts/if* and /etc/sysconfig/network in particular will be boned either by bad cfengine config, by a terrible manual config, or a combination of the two. This is unfortunately only noticed after a reboot in most cases.

  • Additionally loaded modules have a chance to panic the kernel, although you'd presumably see this beforehand.
  • SElinux policies/file permissions are FUBARed and the system is defaulting to Enforcing mode. Matt ran into this when attempting to set up raid10 by copying over files post-install.

rcX.d

The system now will call /etc/rc.d/rc $RUNLEVEL to run whatever scripts are defined. These scripts reside in /etc/rc.d/rc$RUNLEVEL.d/ and are (or should be) symlinks from scripts in /etc/init.d/. They should start with either an S (start-order when entering this runlevel) or a K (reverse kill-order when leaving this runlevel) and have a number to indicate priority. There are far too many application problems to list specifically, since this is just about the scripts. On RedHat/Fedora systems the chkconfig command is used to set up symlinks automatically.

Stuff that can go wrong

  • Due to dodgy chkconfig (or manual symlinking) you're trying to start a service like S09foo-network before a service that it depends on, like S10network, for basic functionality.

  • Bad initscripts that return "OK" before the process is actually loaded, causing dependent services to fail.
  • Poorly-designed or conflicting commands added to rc.local hanging the system.

  • Bad scripts that hang or issue an incorrect exit status. Debian has several of these.
  • Dud symlinks not starting due to poor package removal scripts.
  • Incorrect chkconfig runlevel parameters trying to start services that require an X session when no X server is present at a given runlevel.

  • Incorrect permissions (including SElinux permissions) on the script.

More system commands

What to do during Control-Alt-Delete, UPS-notified power failure, and power recovery are set.

Stuff that can go wrong

  • Changing the command used or removing /sbin/shutdown could lead to problems. I don't know why you'd ever do this.

  • Incorrect setup of whatever is monitoring your UPS could lead to lots of fun reboots.

Start terminals and respawn processes

init now spawns six terminals, assuming you're in runlevel 2-5 and haven't messed with it. These are set to spawn again when they die. Depending on config it may also spawn serial terminals. By default, RedHat looks like it respawns xdm in runlevel 5, which none of our servers should be set to use.

Stuff that can go wrong

  • Some systems without a (supported) serial port will constantly respawn the serial tty and die. This is likely a cfengine issue on our systems and shouldn't halt the system.
  • Dumb editing of /etc/inittab can potentially respawn a process indefinitely. If this process produces some sort of noticeable load on the system it could make the machine unusable after a short time.

  • If you remove the terminal app (mingetty/agetty) or wholesale copy a config from another server (like Ubuntu, which spawns getty) then you will be left with an unusable machine and a trip to the datacentre.