Tag

bugfix Archives - AWS Managed Services by Anchor

Inquisitio interrupta

By | Technical | 2 Comments

A customer’s Django app has been giving us hell for a little while now, something we’ve recently tracked down to dodgy signal-handling in some MySQL library code. Despite only showing up a dozen times in every 600 million queries or so, we’ve nailed it! It turns out the bug has been hanging around, on and off, for the better part of ten years now – that’s a long time! While it sounds simple on paper, it’s gone unfixed for so long because it only manifests in very specific conditions. Arriving at this conclusion was something of a surprise, but definitely worthwhile, so come join us on an adventure with surprises at every turn. The problem was first noticed by the customer, they get a mail when an exception goes uncaught….

Read More

Hopping mad: bitten by the leap second bug

By | Technical | No Comments

It’s been an exciting weekend for those running Linux systems, with many popular websites taken offline as a result of a kernel bug relating to the addition of a leap second at the end of June. This follows hot on the heels of an outage that saw many Instagram users tragically unable to upload pictures of their breakfast. There’s no shortage of stories in the media about the problem if you care to look around, but we were rather disappointed by the lack of coverage with much substance. We’ve also not yet found an explanation that can be understood by someone who isn’t intimately familiar with the internals of the Linux kernel. So we put on our detective hat and started looking. Update: if you are familiar with the innards…

Read More

Answers for DRBD time-travel issues

By | Technical | No Comments

A little update on a DRBD problem we wrote about at the start of April, in which in which we lost a few months of data during a cluster failover. Linbit got in touch with us to offer assistance, and we were happy to be enlightened. We had a good idea of what had happened, but no idea why. It seems that a race condition was introduced in version 8.3.9, when the fence-peer script was changed to run asynchronously. The engineering team explained that if the connection is reestablished while the script runs, it may happen that the peer’s disk-state gets overwritten with stale information. This was fixed in 8.3.11, and of course we’re running version 8.3.10 on the cluster in question. We’d like to thank Linbit for their assistance…

Read More