Pacemaker and Corosync for HA services

Now that we’ve got our terminology sorted out, we can talk about real deployments. Our most common HA deployments use the Linux HA suite, with multiple services managed by pacemaker. This is roughly the “stack” that we referred to in the first post in the series.

We’ve already covered the resources involved, so we’ll focus on the important bit:
What happens when something goes wrong?

Normal operation

Recall that on our hypothetical HA database server, we’ve got the following managed resources:

  • DRBD storage
  • The filesystem
  • Floating IP address for the service
  • The DB service itself

Each resource has its own monitor action, specified by the Resource Agent (RA). Roughly speaking, an RA is a script that implements a common interface between pacemaker and the resources it can manage. It looks a lot like an initscript, but more rigorously defined.

The monitor action is straightforward: pacemaker runs it regularly (20sec is a normal interval), and it either says the resource is running fine, is not running, or the action times out. So long as pacemaker keeps hearing good news, nothing exciting happens.

Before we go too much further, let’s quickly discuss what “monitor” means.

Monitoring cluster resources

Each resource needs some sort of monitoring to be useful. Pacemaker doesn’t care how it works, so long as it happens. “Success” in the monitor action means:

  • For a DRBD device we check that the kernel module is loaded, and that the local node is in either the DRBD Primary or Secondary role
  • A filesystem must be mounted (check /proc/mounts). We can optionally also check that the filesystem is writeable
  • An IP address is bound to an interface
  • A database must answer a basic SELECT query over a standard client connection

Monitoring is pretty straightforward, but it’s important (and sometimes difficult) to write monitoring actions that accurately reflect the state of the resource, without depending on correct functionality of an unrelated component.

An example of this would be a network fault causing problems for an NFS mount, which affects your ability to read the status of a local (pacemaker-managed) filesystem.

Recovering from monitoring failures

So what happens if a monitor action fails? If the resource isn’t running, pacemaker will try to run the start action to bring it up. If the monitor action times out, it will try to cleanly stop the resource and then start it again, possibly on the other cluster node.

This makes for a resilient system that tries to repair itself in the face of failure. Things get more interesting if recovery also fails, and there’s where STONITH steps in.

When recovery fails

All of the stop/start/monitor actions have timeouts built-in, and pacemaker will attempt to handle a timeout condition as well. We’ve already seen that a monitor-timeout translates to a stop and start. A timeout on start isn’t a big deal, we can try to start it again. A failure or timeout on stop is considered critical.

A broken resource that can’t be stopped gracefully needs to be taken by force. We’ve already covered this pretty well in the first article so we won’t dwell on it, but it suffices to say that you’ll incur a bit of downtime as the cluster sorts things out and brings services up again.

Summary and evaluation

The deployment we’ve described is reliable and well-behaved. Because each resource is self-contained and independent, any problems are usually straightforward to diagnose and repair.

We’ve found that most services can be decomposed into a similar stack of resources – in the end it’s just a daemon being started-up on a server, and sometimes the server it runs on changes.

There are some services that don’t play nicely this way though, and sometimes you want to manage something bigger, like a VM. We’ll cover this in our next post on high availability deployments.