Tag

debugging Archives - AWS Managed Services by Anchor

A gentle intro to bughunting

By | Technical | One Comment

A lot of the bughunting that we talk about here is pretty involved, and requires in-depth knowledge of the systems and conventions in play. It’s not exactly conducive to learning if you’re just trying to get started, so we thought we’d take the opportunity to walk through a small bug that we found the other day. It’s really basic, is limited to userspace, and only needs a couple of common tools. We’ll assume that you know a little bit of C and have used gdb to poke around your own code before, but are stuck when it comes to real-world problems. Something that we deploy heavily at Anchor is daemontools, written by Dan Bernstein (aka. DJB). It keeps services running and works in conjunction with a few other small utilities…

Read More

Out-Tridging Tridge

By | Technical | 2 Comments

Andrew Tridgell’s rsync utility is widely used for pushing files around between servers. Anyone can copy files across the network, what makes rsync special is that it compares the files on each end and only transfers the differences, instead of pushing the whole file across the wire when only a few bytes need to be updated. We use rsync to backup servers every single day, though we recently found a few big files where rsync was going to take an eternity to perform what should’ve been several hours of work. So, we busted out the butterfly nets and went to catch us some bugs. rsync in a nutshell Most servers don’t change a great deal on a day to day basis, so transferring just the differences is a very smart…

Read More

A most unusual temporal issue

By | Technical | No Comments

Most of us have suffered time-related woes, dealing with servers that just can’t keep their clock straight even with NTP. Well we’ve just come across a new one, and NTP isn’t going to save the day – what happens when you have multiple misbehaving clocks? We first noticed the problem when reloading the firewall on the host, a KVM server that had been virtualised from hardware. We use filtergen for ruleset generation, and one of its postreload scripts restarts fail2ban, an anti-bruteforce tool. One part of fail2ban’s initscript makes a call to sleep 1 to give things time to settle. We noticed that the invocation of sleep was taking more than a second, much more. Something wasn’t right and the automatic 15sec rollback was kicking in, having not received explicit…

Read More

Inquisitio interrupta

By | Technical | 2 Comments

A customer’s Django app has been giving us hell for a little while now, something we’ve recently tracked down to dodgy signal-handling in some MySQL library code. Despite only showing up a dozen times in every 600 million queries or so, we’ve nailed it! It turns out the bug has been hanging around, on and off, for the better part of ten years now – that’s a long time! While it sounds simple on paper, it’s gone unfixed for so long because it only manifests in very specific conditions. Arriving at this conclusion was something of a surprise, but definitely worthwhile, so come join us on an adventure with surprises at every turn. The problem was first noticed by the customer, they get a mail when an exception goes uncaught….

Read More

The sysadmin’s essential diagnostic toolkit

By | Technical | 4 Comments

We’ve had a number of people ask us recently what sort of procedures and tricks we use when hunting down problems on systems we maintain, as a lot of the work can seem magical at times. While there’s no short answers to these sorts of questions (you could fill many, many pages with the topic), we thought we’d share some of the most commonly-used tools with you. You may already know a couple of them, which is great. Their real value comes from knowing how and when to use them. strace This is the big one for us, because it so frequently tells us exactly what we need to know. strace hooks into a process using the kernel’s ptrace facility and prints a list of all the syscalls made by…

Read More

Bugfixing the in-kernel megaraid_sas driver, from crash to patch

By | Technical | 2 Comments

Today we bring you a technical writeup for a bug that one of our sysadmins, Michael Chapman, found a little while ago. This was causing KVM hosts to mysteriously keel over and die, obviously causing an outage for all VM guests running on the system. The bug was eventually traced to the megaraid_sas driver and the patch has made it to the kernel as of version 3.3. As you can imagine, not losing a big stack of customer VMs at a time, possibly at any hour of the day, is a pretty exciting prospect. This will be a very tech-heavy post but if you’ve ever gone digging into kernelspace (as a coder, or someone on the ops side of the fence) we hope it’ll pique your interest. We’ll talk about…

Read More