Tag

analysis Archives - AWS Managed Services by Anchor

Inquisitio interrupta

By | Technical | 2 Comments

A customer’s Django app has been giving us hell for a little while now, something we’ve recently tracked down to dodgy signal-handling in some MySQL library code. Despite only showing up a dozen times in every 600 million queries or so, we’ve nailed it! It turns out the bug has been hanging around, on and off, for the better part of ten years now – that’s a long time! While it sounds simple on paper, it’s gone unfixed for so long because it only manifests in very specific conditions. Arriving at this conclusion was something of a surprise, but definitely worthwhile, so come join us on an adventure with surprises at every turn. The problem was first noticed by the customer, they get a mail when an exception goes uncaught….

Read More

The sysadmin’s essential diagnostic toolkit

By | Technical | 4 Comments

We’ve had a number of people ask us recently what sort of procedures and tricks we use when hunting down problems on systems we maintain, as a lot of the work can seem magical at times. While there’s no short answers to these sorts of questions (you could fill many, many pages with the topic), we thought we’d share some of the most commonly-used tools with you. You may already know a couple of them, which is great. Their real value comes from knowing how and when to use them. strace This is the big one for us, because it so frequently tells us exactly what we need to know. strace hooks into a process using the kernel’s ptrace facility and prints a list of all the syscalls made by…

Read More

Hunting down unexpected behaviour in Corosync’s IP address selection

By | Technical | No Comments

Update from 2012-05-24: The Corosync devs have addressed this and a patch is in the pipeline. The effect is roughly as described below, to build the linked list by appending to the tail, and preferring an exact IP address match for bindnetaddr (which was intended all along but got lost along the way). Rejoicing all round! We’ve been looking at some of Corosync’s internals recently, spurred on by one of our new HA (highly-available) clusters spitting the dummy during testing. What we found isn’t a “bug” per se (we’re good at finding those), but a case where the correct behaviour isn’t entirely clear. We thought the findings were worth sharing, and we hope you find them interesting even if you don’t run any clusters yourself. Disclaimer: We’d like to emphasise…

Read More

Bugfixing the in-kernel megaraid_sas driver, from crash to patch

By | Technical | 2 Comments

Today we bring you a technical writeup for a bug that one of our sysadmins, Michael Chapman, found a little while ago. This was causing KVM hosts to mysteriously keel over and die, obviously causing an outage for all VM guests running on the system. The bug was eventually traced to the megaraid_sas driver and the patch has made it to the kernel as of version 3.3. As you can imagine, not losing a big stack of customer VMs at a time, possibly at any hour of the day, is a pretty exciting prospect. This will be a very tech-heavy post but if you’ve ever gone digging into kernelspace (as a coder, or someone on the ops side of the fence) we hope it’ll pique your interest. We’ll talk about…

Read More