Anatomy of an HA stack

In what we plan to be a small series of articles about our high availability deployments, we thought we’d start by defining the key components in the stack and how they work together.

In future we’ll cover some of the more specific details and things that need to be taken into consideration when deploying such a system. For now we’ll talk about the bits that we use, and why we use them.

Type of deployment

A highly available system is also highly complex, so it’s important to know just what problem you’re trying to solve when you take on that burden.

Our systems are designed to deal with the total failure of a server chassis. This is very low-level and was chosen because it provides the greatest flexibility when dealing with various software stacks.

To be clear, this is not at all like a clustered application, which is written to run on multiple servers at once. In our setup the active server can fail, and the standby server will step in to take the load.

A high-availability deployment can be be as large and complex as you want, but we like to keep things simple. Some nomenclature:

  • We’ll only talk about two-server deployments, which covers almost every system we manage
  • In a two-server setup, one is “active” and the other is “standby”
  • Together, these form an “HA cluster”
  • Each server in the cluster is an “node”

Hardware

Because we’re dealing with whole-machine failure scenarios, we use servers with identical specifications to build the cluster.

Each chassis must be powerful enough to shoulder the full load on its own, as there’s no expectation to share the load within the cluster.

The scenario

Now that we’ve got the basics out of the way, we’ll present a fairly common use case for such a setup: a highly-available PostgreSQL database server.

Note that we’ve not trying to use replication here, that’s for solving a different problem. Replication could be used to effect the same outcome in this scenario, but it introduces a different sort of complexity and more work to repair things when the active server fails.

Total hardware failures aren’t terribly common. The point of HA here is to mitigate the risk of extended downtime if things go bad, and squeeze out an improved uptime figure. As a bonus, routine maintenance can be carried out on the cluster servers with minimal disruption to services.

Corosync

At its most basic, running a cluster is a matter of ensuring all the members are talking to each other, on the same page, and then sending messages to negotiate who should be running a particular cluster-managed service.

Corosync is the messaging layer of the cluster, effectively holding everything together. It handles membership of the cluster and ensures that problems are detected very quickly. This information is communicated up the stack to the Cluster Resource Manager (crm) in Pacemaker, whose job it is to actually do something about it.

Pacemaker

Making use of the Corosync cluster engine, it’s Pacemaker’s job to actually take care of the managed resources in the cluster.

While we tell Corosync about the nodes in the cluster, we tell Pacemaker about what resources to run in the cluster, and how it should be done.

A resource is just anything that Pacemaker can manage. While it can be almost anything you like, typical examples are DRBD devices, filesystem mounts, IP addresses, etc.

Just starting resources isn’t enough though – we need to make sure that resources are started in the correct order, and on the right node. This is where constraints come in (eg. start A before B, and B before C). For example, we can’t mount a filesystem until the underlying DRBD block device is up and running. Similarly, we can’t start a daemon that listens on the network until its IP address is brought up on the same machine.

Resources

Now that we have the management components out of the way, we can talk about the building blocks of actually running an HA database on it.

Without constraints, HA resources are effectively independent. To be useful to us, we build them into a stack. Resources higher in the stack necessarily depend on resources further down the stack, as described in the previous section.

The stack isn’t part of Pacemaker’s config, it’s purely conceptual. In action, we’ll push the whole stack of resources between cluster nodes.

In rough order that they appear in the stack, we’ll look at the DRBD storage, filesystem, IP addresses, and the database daemon.

DRBD

DRBD stands for Distributed Replicated Block Device, and can be thought of as RAID-1 over a network. DRBD provides us with a block device that is guaranteed to be identical at both ends, giving us a form of shared storage between two cluster nodes.

Because DRBD presents a generic block device to the system, it can be formatted with a filesystem and used exactly as you would any other storage medium.

A DRBD device is a Pacemaker-managed resource, with the constraint that it can only be used on one cluster node at a time (the one that will run the database daemon).

DRBD must be prepared before we can mount the resident filesystem, which is the next step.

Filesystem

Before using a DRBD device for the first time we create a filesystem, usually a vanilla ext3 or ext4. Once prepared, we can then have Pacemaker manage the mounting/unmounting at /var/lib/pgsql.

The filesystem can only be mounted on one cluster node at a time, which Pacemaker will guarantee. A cluster-filesystem can be multi-mounted, but would provide no benefit in this scenario.

The filesystem must be mounted after DRBD is started, and before we attempt to start the Postgres daemon.

IP address

To provide a consistent entry point to the database, we create a special “floating” HA IP address that will always be present on the active cluster node.

Like the other resources, the IP address can only be used on one cluster node at a time. Pacemaker will handle this for us.

The IP address can be brought online at any time (eg. while the DRBD device and filesystem are being prepared), but it must be before Postgres is started.

STONITH

The last component in the cluster is STONITH, which stands for “Shoot The Other Node In The Head” (this page has a fairly accurate graphical depiction of the concept).

Things will sometimes break or malfunction in an HA cluster; this is expected. Some types of failure are tolerable (eg. by retrying), while others are more critical. A communications failure is the latter.

STONITH exists to solve a problem called “split-brain“. If the two nodes can’t talk to each other, they can’t be sure who’s at fault. Because it’s their job to make sure all the resources are running, they’ll both want to take the “active” role in the cluster. This is the split-brain.

A split-brain situation is dangerous because both nodes will attempt to use resources that can’t be shared. If a communications failure is preventing us from asking the other node to gracefully let go of (“stop”) a resource, we use the nuclear option and switch off power to the other node.

As an example, assume we have two nodes Alpha and Beta that manage an ext3 filesystem.

  1. The filesystem is currently mounted on Alpha
  2. A clumsy datacentre technician is moving some cabling and inadvertently unplugs the switch carrying the cluster traffic
  3. Alpha thinks that Beta has crashed. This is no big deal, the filesystem is still mounted
  4. Beta thinks that Alpha has crashed, oh no! We need to unmount the filesystem on Alpha and mount it locally on Beta
  5. The network is down, so Beta can’t ask Alpha to unmount the filesystem
  6. Beta invokes a STONITH action on Alpha, it’s the only way to be sure! Alpha’s DRAC receives a poweroff command and promptly shuts down, hard
  7. Beta now mounts the filesystem. It needs a fsck because it wasn’t cleanly unmounted, but we’re up and running again shortly

STONITH is obviously a very violent operation, so we want to make sure it only kicks in when things have really gone bad and we’re out of options to get the resources started. We guard against this possibility by having redundant links for our cluster traffic.


That wraps up our introduction to HA. In the near future we’ll talk more about how HA clusters are used as part of a larger system, and the kinds of considerations you need to make when adding one to your architecture. If anything is unclear or you just have a burning question, feel free to leave a comment.