Holy time-travellin’ DRBD, batman!

Here at Anchor we’ve developed High-Availability (HA) systems for our customers to ensure they remain online in the event of catastrophic hardware failure. Most of our HA systems involve the use of DRBD, the Distributed Replicated Block Device. DRBD is like RAID-1 across a network.

We’d like to share some notes on a recent issue that involved a DRBD volume jumping into a time-warp and rolling back four months. If you run your own DRBD setup, you’ll want to know about this. The chances that you hit the same problem are slim, but it’s not hard to avoid.

We have a script for Nagios that checks the health of your DRBD volumes, it was basically the go-to default check_drbd script on Nagios Exchange. The script is meant to ensure that both ends are in-sync, and that the connection is up.

The volume in question is the backing store for a virtual machine (VM) guest. One day, after an otherwise-ordinary cluster failover event, it was noticed that the VM’s disks had reverted to a state from last year in November. The monitoring had never tripped, what the heck was going on?

Our sysadmins started digging. Pacemaker generates a lot of logging output, this was one time it came in useful. This assumes you have some familiarity with DRBD and (ideally) Pacemaker’s cluster management functions:

  1. Everything was working fine
  2. A blip on the cluster caused the active server (server A) to attempt a fence action on the volume on the standby (server B)
  3. The fence action failed for some reason
  4. Server A says “Hmm, okay, whatever”, and stops sending DRBD updates to server B
  5. The DRBD connection remains up and running
  6. Server A’s monitoring script says “I’m the Primary node so I’m up-to-date, and the connection is up: OK”
  7. Server B’s monitoring script says “I’m the Secondary node, my data is ‘consistent’ (not half-synced), and the connection is up: OK”
  8. Everything looks okay and noone is aware that server B’s copy of the data is slipping further and further out of date
  9. Eventually a full-on cluster failover occurs, server B receives the call to action, and goes right ahead as it knows its data is consistent (represents a known point-in-time) but not that it’s very outdated

In short, a corner case in DRBD’s workings got the volume into a bad state that went undetected by the monitoring script. This allowed old data to replace new data during a failover.

When server A attempted the fencing action and failed, it knew something was wrong. It couldn’t tell what, but it didn’t trust it any more, so it stopped sending new data to server B.

Each server knows a little bit about the disk at the other end, thanks to the DRBD connection working just fine. Server A knew its disk was good but noted server B’s disk as “DUnknown” – something dodgy going on. Server B thought its own disk was fine (correct: it hadn’t received the fencing request) and knew server A’s disk was fine (server A is automatically trusted as the Primary node).

Server B’s “DUnknown” state is what Nagios didn’t see, and it should’ve been a warning bell. Server B willingly took over after the failover because everything looked fine, just that server A had been really, really quiet for the last few months. As the new Primary node it promptly pushed its copy of the volume back to server A, steamrolling 4 months of changes in the process.

Our immediate fix for this was to improve the monitoring script. The remote peer’s disk state is now taken into account, and the script was heavily restructured to improve readability and aggregate data in a more structured manner. We’ll be able to push the improvements to Github once we’ve cleaned it up a little further.

EDIT: It’s been published now: https://github.com/anchor/nagios-plugin-drbd

We’re also further investigating the fencing actions for DRBD. Building fault-tolerant systems is hard, which is why you employ defense-in-depth strategies – it may be that the fencing actions also need defensive measures.