Burning Down The Computer
Patrick Kelso investigates the smokey world of stress testing new dedicated server hardware at Anchor.
Go on, admit it, you've thought about it yourself. Wouldn't it be satisfying to set your computer alight? Sadly, that is not what this article is about (although, keep an eye out for future articles :). Burning In is the term used to describe the process of testing new managed server hardware for faults before putting it to use in a live environment. This is done by running "Stress testing" software for some period of time.
Whenever we get new server hardware, we always do a complete burn-in to ensure that the server hardware is up to our high standards. If the hardware fails at any point, we send it back to the supplier. The actual process is easy, although setting it up isn't.
First, when the new server is turned on, we boot off the network, which allows us to boot multiple machines at once without needing 20+ bootable disks. (For information on setting up Linux network booting, checkout the Linux Terminal Server Project, linked from the bottom of this page). The first test run is called memtest; this thoroughly checks the computers memory, and runs for about 1 day.
If the server passes memtest, it is restarted and booted into a custom Red Hat kickstart install that will install a bare redhat environment, along with the Cerberus Test Control System, special software that runs numerous tests on all the hardware in the system.
Cerberus performs several tasks to test the CPU. It compiles the Linux kernel over and over again, runs complicated mathematical problems (how long does it take you to work out whether 3,214,235,409,234,472,020,393,848,453 is prime?), and runs some code specifically written to run the CPU at its hottest.
Cerberus writes large volumes of data to the hard drives repeatedly to ensure that the drive platters are functional. It will also delete and move files, and check the disks for errors.
If after a week the server is still running (not smoking) and hasn't crashed, it is considered good enough for use as a production machine. If it fails the tests anywhere along the way, it is packed up and returned to the supplier for replacement. Web servers that have survived this process should certainly survive anything you can throw at them.