Tag

bug Archives - AWS Managed Services by Anchor

A most unusual temporal issue

By | Technical | No Comments

Most of us have suffered time-related woes, dealing with servers that just can’t keep their clock straight even with NTP. Well we’ve just come across a new one, and NTP isn’t going to save the day – what happens when you have multiple misbehaving clocks? We first noticed the problem when reloading the firewall on the host, a KVM server that had been virtualised from hardware. We use filtergen for ruleset generation, and one of its postreload scripts restarts fail2ban, an anti-bruteforce tool. One part of fail2ban’s initscript makes a call to sleep 1 to give things time to settle. We noticed that the invocation of sleep was taking more than a second, much more. Something wasn’t right and the automatic 15sec rollback was kicking in, having not received explicit…

Read More

Pulling apart Ceph’s CRUSH algorithm

By | Technical | One Comment

As you’ve probably noticed, we’ve been evaluating Ceph in recent months for our petabyte-scale distributed storage needs. It’s a pretty great solution and works well, but it’s not the easiest thing to setup and administer properly. One of the bits we’ve been grappling with recently is Ceph’s CRUSH map. In certain circumstances, which aren’t entirely clearly documented, it can fail to do the job and lead to a lack of guaranteed redundancy. How CRUSH maps work The CRUSH map algorithm is one of the jewels in Ceph’s crown, and provides a mostly deterministic way for clients to locate and distribute data on disks across the cluster. This avoids the need for an index server to coordinate reads and writes. Clusters with index servers, such as the MDS in Lustre, funnel…

Read More

Bughunting in Ceph’s radosgw: ETags

By | Technical | No Comments

RADOS Gateway (henceforth referred to as radosgw) is an add-on component for Ceph, large-scale clustered storage now mainlined in the Linux kernel. radosgw provides an S3-compatible interface for object storage, which we’re evaluating for a future product offering. We’ve spent the last few days digging through radosgw source trying to nail a some pesky bugs. For once, the clients don’t appear to be breaking spec, it’s radosgw itself. We’re using DragonDisk as our S3-alike client – what works? PUTing and GETing files works, obviously. Setting the Content-Type metadata returns a failure, and renaming a directory almost works – it gets duplicated to the new name, but the old copy hangs around. Wireshark to the rescue! We started pulling apart packet dumps, and it quickly became evident that setting Content-Type on…

Read More

Why does Percona cause SSL issues in Postfix?

By | Technical | No Comments

When we came across this misbehaviour a little while ago we didn’t think too much of it. We unearthed the cause and worked around it, but ultimately dismissed it as an odd one-off. Since then it’s cropped up a few more times, and it doesn’t look like it’s going away, so we thought we’d tell a little more about it and what we’re doing to fix it. The segfault We use Postfix as our MTA on almost every Linux server in the company. As Anchor’s resident Postfix guru I often get asked for assistance when it comes to troubleshooting Postfix problems. I’ve seen plenty of them, so I can spot the most common issues and teach people how to fix them. Sometimes you run into something you haven’t seen before,…

Read More

Inquisitio interrupta

By | Technical | 2 Comments

A customer’s Django app has been giving us hell for a little while now, something we’ve recently tracked down to dodgy signal-handling in some MySQL library code. Despite only showing up a dozen times in every 600 million queries or so, we’ve nailed it! It turns out the bug has been hanging around, on and off, for the better part of ten years now – that’s a long time! While it sounds simple on paper, it’s gone unfixed for so long because it only manifests in very specific conditions. Arriving at this conclusion was something of a surprise, but definitely worthwhile, so come join us on an adventure with surprises at every turn. The problem was first noticed by the customer, they get a mail when an exception goes uncaught….

Read More

Hopping mad: bitten by the leap second bug

By | Technical | No Comments

It’s been an exciting weekend for those running Linux systems, with many popular websites taken offline as a result of a kernel bug relating to the addition of a leap second at the end of June. This follows hot on the heels of an outage that saw many Instagram users tragically unable to upload pictures of their breakfast. There’s no shortage of stories in the media about the problem if you care to look around, but we were rather disappointed by the lack of coverage with much substance. We’ve also not yet found an explanation that can be understood by someone who isn’t intimately familiar with the internals of the Linux kernel. So we put on our detective hat and started looking. Update: if you are familiar with the innards…

Read More

WordPress 2.7, now with fewer absurd bugs

By | Technical | No Comments

I went ahead and upgraded the installation of wordpress we use for this blog from 2.6 to 2.7 – you won’t notice anything mind you, but we get a completely different admin interface under the hood. Keeping things up to date is always a good idea from a security standpoint, but I also wanted to address an odd issue that wasn’t present in my own personal installation of 2.7. I’d noticed a little while ago that the font-colour controls in the editor didn’t seem to work. I could select the text and apply the colour, but the change disappeared once I saved the changes. Looking at the HTML, something odd was afoot: <span style=”#990000″>lorem ipsum dolor</span> Definitely not the expected behaviour, the “color:” was being stripped out of the style…

Read More