Holy time-travellin’ DRBD, batman!

Here at Anchor we’ve developed High-Availability (HA) systems for our customers to ensure they remain online in the event of catastrophic hardware failure. Most of our HA systems involve the use of DRBD, the Distributed Replicated Block Device. DRBD is like RAID-1 across a network.

We’d like to share some notes on a recent issue that involved a DRBD volume jumping into a time-warp and rolling back four months. If you run your own DRBD setup, you’ll want to know about this. The chances that you hit the same problem are slim, but it’s not hard to avoid.


We have a script for Nagios that checks the health of your DRBD volumes, it was basically the go-to default check_drbd script on Nagios Exchange. The script is meant to ensure that both ends are in-sync, and that the connection is up.

The volume in question is the backing store for a virtual machine (VM) guest. One day, after an otherwise-ordinary cluster failover event, it was noticed that the VM’s disks had reverted to a state from last year in November. The monitoring had never tripped, what the heck was going on?

Our sysadmins started digging. Pacemaker generates a lot of logging output, this was one time it came in useful. This assumes you have some familiarity with DRBD and (ideally) Pacemaker’s cluster management functions:

  1. Everything was working fine
  2. A blip on the cluster caused the active server (server A) to attempt a fence action on the volume on the standby (server B)
  3. The fence action failed for some reason
  4. Server A says “Hmm, okay, whatever”, and stops sending DRBD updates to server B
  5. The DRBD connection remains up and running
  6. Server A’s monitoring script says “I’m the Primary node so I’m up-to-date, and the connection is up: OK”
  7. Server B’s monitoring script says “I’m the Secondary node, my data is ‘consistent’ (not half-synced), and the connection is up: OK”
  8. Everything looks okay and noone is aware that server B’s copy of the data is slipping further and further out of date
  9. Eventually a full-on cluster failover occurs, server B receives the call to action, and goes right ahead as it knows its data is consistent (represents a known point-in-time) but not that it’s very outdated

In short, a corner case in DRBD’s workings got the volume into a bad state that went undetected by the monitoring script. This allowed old data to replace new data during a failover.

When server A attempted the fencing action and failed, it knew something was wrong. It couldn’t tell what, but it didn’t trust it any more, so it stopped sending new data to server B.

Each server knows a little bit about the disk at the other end, thanks to the DRBD connection working just fine. Server A knew its disk was good but noted server B’s disk as “DUnknown” – something dodgy going on. Server B thought its own disk was fine (correct: it hadn’t received the fencing request) and knew server A’s disk was fine (server A is automatically trusted as the Primary node).

Server B’s “DUnknown” state is what Nagios didn’t see, and it should’ve been a warning bell. Server B willingly took over after the failover because everything looked fine, just that server A had been really, really quiet for the last few months. As the new Primary node it promptly pushed its copy of the volume back to server A, steamrolling 4 months of changes in the process.


Our immediate fix for this was to improve the monitoring script. The remote peer’s disk state is now taken into account, and the script was heavily restructured to improve readability and aggregate data in a more structured manner. We’ll be able to push the improvements to Github once we’ve cleaned it up a little further.

EDIT: It’s been published now: https://github.com/anchor/nagios-plugin-drbd

We’re also further investigating the fencing actions for DRBD. Building fault-tolerant systems is hard, which is why you employ defense-in-depth strategies – it may be that the fencing actions also need defensive measures.

  • http://paperairoplane.net oliver

    I can’t remember if I ever actually implemented this, but the DRBD remote invalidator sounds like it would rectify this exact situation. Where there is still connectivity between hosts, the primary would invalidate the secondary and no failover would be possible until the secondary had been forcibly resynced.

    I guess with multi-tenant hypervisors, a brute force fencing mechanism like STONITH is out of the question though ;)

    • Barney Desmond

      Yep, that’s pretty much the idea. We have the resource-fencing stuff, it was rolled out pretty recently (in the last several months or so). I’d need to go digging, but I suspect that’s part of what went wrong. I could, however, be confusing this with another feature whereby a VM running HA services can STONITH another VM on another host, through the power of libvirt.

      You’re right about STONITH on hypervisors being inappropriate, it’s kind of hard to get out of the “Just STONITH it!” mindset when you move from HA web/DB clusters to HA hypervisors. :)

  • http://www.linbit.com Kavan @ LINBIT

    We’re sorry to be hearing that you experienced this result in your clusters. We (LINBIT – the DRBD folks) want to assure you and anyone else out there who’s watching that this shouldn’t be happening in a properly configured DRBD system. As the developers of DRBD, we of course have best practice information that can help prevent this from happening.

    Barney, please contact us to forward your logs or any other relevant information you’d be willing to provide – team_us@linbit.com

    • Barney Desmond

      Hi Kavan, thanks for the comment. I’ve dropped you an email at that address with some details that may be interesting.

  • Martin Loschwitz

    Barney, just a question that came to my mind when reading your blog post: You’re mentioning that you people at Anchor have deployed HA setups and that there is some sort of fencing in place, and you also mentioned that there are Pacemaker logs, so I assume you are using Pacemaker. From the technical point of view, I would claim that the very first instance to realize that there is a problem like the one you have described would be Pacemaker; have you configured your DRBD resources in Pacemaker with the monitoring action enabled? And weren’t there any failed resource entries for this? (Failed resource entries can be checked with standard monitoring tools in Nagios, too, btw)

    Best regards
    Martin

    • Barney Desmond

      Hi Martin, some answers to your questions regarding the pacemaker monitoring.

      1. Yes, we do have monitoring enabled. The DRBD RA wouldn’t have picked up this problem though (I believe that’s scored in drbd_update_master_score() in the ocf:linbit:drbd RA)

      2. No, there were no failed resource entries. Even if the RA could detect it, Nagios showed no problems in our pacemaker monitoring check.

Ready to talk business? Send us a note.