Pacemaker and Corosync for virtual machines
In the previous post we talked about using Corosync and Pacemaker to create highly available services. Subject to a couple of caveats, this is a good all-round approach.
The caveats are what we’ll deal with today. Sometimes you’re dealing with software that won’t play nice when moved between systems, like a certain Enterprise™ database solution. Sometimes you can’t feasibly decompose an existing system into neat resource tiers to HA-ify it. And sometimes, you just want HA virtual machines! This can be done.
If the solution to our problem is to run everything on a single server, so be it. We then virtualise that server, and make it highly available.
Once again, it’s important to remember that we’re guarding against a physical server going up in smoke. There’s no magic scalability here, and ideally the HA subsystem never actually does anything, except when there’s a major problem.
As per our standard setup, we’re using KVM for virtualisation as it’s been mainlined into the Linux kernel.
A highly-available VM is really simple, it comprises just two pacemaker resources:
- DRBD for replicated storage
- Running of the VM itself
After this, everything else is pretty standard – DRBD needs to start before the VM, and stop after it. The VM management is a new type of pacemaker resource, and that’s it. The start/stop/monitor actions in the resource agent script call out to the libvirt library, and let it handle the hard work.
How failure is handled
At this point, the VM is an opaque black box. As long as libvirt reports that the VM is running, pacemaker won’t do anything. This is good because it means you can treat the VM as a normal machine; apply all your usual monitoring for the Enterprise database app, and kick it as usual when it breaks. A BSoD or kernel panic is nothing special either: the VM is still running.
The failure case that we do care about is if one of the KVM hosts stops working. If the VM monitor action times out or the host stops responding, the standby node in the clustered pair will notice, possibly STONITH the bad node, and take over the running of your VM.
It’s important to know what this means for your VM: Pacemaker will attempt to cleanly shut down the VM, then yank the virtual power cord if that fails. This means that when the VM comes up on the standby node it will have to deal with an unclean shutdown, which can take a long time if a fsck/chkdsk is needed. HA cannot help you in this scenario!
Things to note
Pacemaker adds an extra layer of fun if you forget (or don’t know) that it’s keeping an eye on things: it’ll keep restarting a VM that you’re trying to shutdown, unless you tell it to stop managing it. This doesn’t happen on an ordinary “services deployment” because pacemaker will hand off resources when it shuts down. Watching an unwitting sysadmin deal with this is like playing with a roly-poly toy. 🙂
Summary and evaluation
While not without limitations, HA for whole VMs can be very convenient. At its extreme, it allows you to offer high availability for servers that you don’t even have login access for.
One catch is that it can be expensive – each KVM host needs enough RAM and diskspace to run all of the VMs in the event of a failure. If you have many VMs on an HA pair there’s a lot of unused computing capacity, which tends to have a large capital cost upfront.
The Linux HA suite offers robust solutions, but doesn’t always ensure the best utilisation of resources. Next time we’ll start talking about high-availability through load-balancing and redundancy, which can be a very nice way to get the scalability and availability that you need if you’re willing to make substantial changes to your application architecture.