What’s the big idea with: Plug and play hypervisors?

September 20, 2012 Technical, General

Here at the Anchor Internet Laboratory we’ve been discussing ideas for new deployments of our VPS infrastructure. One that we’re excited about is the idea of “plug and play” hardware.

Plug and play what?

Deploying more hardware capacity takes time. It needs to be burn-in tested, tracked in the asset management system, installed, configured, and integrated with other systems. It’s not difficult, it just takes time. We’ve got pretty much fully automated provisioning of new VPSes, but the hypervisors that run them need hands-on love.

We think we can make this a lot better.

We’ve been looking at Ceph for shared storage. The key benefit of shared storage for VPS infrastructure is that it decouples the running of VMs (on Compute nodes) from their disks (on Storage nodes).

This would allow us to scale CPU and RAM capacity separately from I/O capacity. That’s a big win if you can make it practical and easy to do so.

Who needs disks?

Focusing purely on the Compute nodes, it was quickly apparent to us that a hypervisor doesn’t really do much with disks. Once the OS is booted you start running your VMs. If the virtual disks are somewhere else, like on shared storage, all that’s left is the VM configs, some state, and logs.

So, we said, we could netboot the hypervisors. Putting your root filesystem on NFS is a long-standing tradition when deploying thin clients. We reasoned that if you store your VM configs elsewhere and ship your logs to a centralised syslog server, your hypervisor basically becomes a thin client.

Who needs an identity?

We thought we could go one better though: the hypervisors could also be anonymous. Hypervisors provide CPU and RAM to run virtual machines, but we don’t really care about who they are. So long as they have enough resources to spare, any hypervisor will do when a VM needs to be started.

An anonymous hypervisor has no distinguishing features, which means all anonymous hypervisors are identical. If they’re all identical it means they can boot the same operating system image. Not having to maintain an operating system on every single hypervisor, of which there are many, is a massive win.

The plan

There’s no doubts there’ll be a fair bit of work involved, but if it’s done right it’ll mean that provisioning new VPS hypervisors becomes a 4-step procedure:

  1. Receive new hardware, add it to asset-tracking, label it, rack it up, etc.
  2. Add a unique identifier to the hypervisor database (eg. MAC address or Dell service tag)
  3. Power-up the server, it PXE-boots and downloads a hypervisor OS image
  4. The OS configures itself, gets on the network, and reports in to the cluster’s command-and-control system, ready to start running VMs

This level of dynamism has some great benefits. Obviously, new hardware can be brought online quickly and with minimal human intervention. Being centralised in the OS image, config changes and package updates are the same basic operation now: live-migrate the running VMs elsewhere and reboot the hypervisor.

Virtual appliances have been a “thing” since virtualisation really took off, so it amuses us that we could soon be running appliance VMs on appliance hypervisors. 🙂