Inquisitio interrupta

A customer’s Django app has been giving us hell for a little while now, something we’ve recently tracked down to dodgy signal-handling in some MySQL library code. Despite only showing up a dozen times in every 600 million queries or so, we’ve nailed it!

It turns out the bug has been hanging around, on and off, for the better part of ten years now – that’s a long time! While it sounds simple on paper, it’s gone unfixed for so long because it only manifests in very specific conditions. Arriving at this conclusion was something of a surprise, but definitely worthwhile, so come join us on an adventure with surprises at every turn.

The problem was first noticed by the customer, they get a mail when an exception goes uncaught. Not a huge number considering the amount of traffic they get, but persistent enough to be a concern, and happening a couple of times a day.

This was given to David, one of our sysadmins, along with backtraces from the customer’s code. The gist of it was communicated in the very first lines of the mail:

Subject: [Django] Unexpected PAException

OperationalError at /upload.json
(2013, 'Lost connection to MySQL server during query')

The mysql server is already rigorously monitored by Nagios and was showing no signs of skipping a beat in the last couple of months. So, we thought, our standard monitoring isn’t cutting it. What we need is monitoring overkill that will watch the server like a hawk, over a long enough period of time to gather data. A real-world long-held MySQL connection tester would do that nicely.

A proof-of-concept implementation was hacked together and pushed into production, really simple: runs in the foreground, makes a single connection and doesn’t ever disconnect, running a simple (but legitimate) query every second.

The damned thing was rock solid for days in a row. While it was returning useful stats about the server, we gave it a little more thought and considered that this might be the wrong way to go about it, and this is where the red herrings started popping up.

We did, however, discover along the way that the problem was not:

  • The network
  • Processes being targeted and killed on the client
  • Active connections timing out
  • MySQL losing state or otherwise dropping connections while we weren’t looking
  • Memory leaks in the client libraries

We first thought the constant queries could be acting as a “keepalive” for mysql, so we tried pulling back on the frequency a bit. Mysql’s wait_timeout parameter would normally cover this and throw a “server has gone away” error, but it was worth a try for a related issue. No such luck, even running a couple dozen instances of the long-connection tester with longer and longer idle pauses. Scratch that from the list, we’re not hitting a connection timeout.

We also looked at mysql’s connection limit. The customer does several thousand queries per second, and our monitoring only has very rough granularity on checking for that – it could well be spiking when we weren’t looking.

To inspect this we leveraged our long-connection tester to return the number of current connections. A couple of quick samples (SHOW STATUS WHERE Variable_Name = 'Threads_connected';) showed it fluctuating pretty wildly, so a big spike isn’t out of the question. If the server gets bogged down and can’t turn requests around fast enough, the number will start rising until the backlog is cleared.

Pumping the data through gnuplot, we plotted the results over an hour or so (the dense graph), and over a minute (the line graph). The number of connections, though highly variable, was looking reasonably well behaved. While we’d spotted a few “Too many connections” errors in the customer’s logs, they weren’t a regular occurrence and didn’t line up with the real problem. Scratch that from the list then.

The timing of the problems so far hadn’t seemed significant. They tended to cluster around the same time across the web-frontends (promising), but for no good reason. It was some time, about a week by this stage, before someone pointed out that while our servers were using AEST (UTC+10), the customer’s Django logs that we’d done all the fancy analysis of were using EDT (UTC-4) timezone.

Suddenly it made sense to see errors logged in the mid-afternoon – early morning cron jobs! We started tearing apart everything in cron.daily, which fires at 06:25 every day, to look for possible causes. It hadn’t been obvious before because the clustering of data points was pretty loose – the effects of cronjobs aren’t necessarily immediate.

The cron.daily tasks tend not to do too much, though logrotate does spray SIGHUP liberally to tell daemon processes to re-open their logfiles (a very common idiom). It’s a bit aggressive, but not unheard of for a SIGHUP to kill off a process that’s in the middle of something.

We went looking. This should be easy to test for… and turned up nothing. We were already testing during peak hour (when we’d expect to see the most problems) but the problem didn’t show itself. This was tantalisingly close, but somehow not quite there. Moving the cron.daily window to an less-loaded period might fix the problem, but only maybe. It would also be sweeping the problem under the rug, and that just doesn’t jive with us.

We would’ve been stuck here if we didn’t get any other ideas. But we did. Another of our sysadmins recalled seeing odd behaviour during long-running mysql import jobs. It’s pretty standard to run long tasks inside a screen session to avoid losing work if you lose your shell, and it was noted that when doing this, everything is fine until you detach the session, then the connection always drops when you reattach. If you don’t detach (and subsequently reattach) the screen, or don’t run the import inside screen, it works fine. This was mighty suspicious and a ridiculous brand of failure as-described, but a lead nonetheless.

After mulling over this for a bit David considered a possible scenario: Something in cron.daily (logrotate is a likely culprit) is sending an otherwise benign signal to the gunicorn_django processes running the app. This kills the active mysql connections that Django is holding open. Only connections with active queries are negatively affected, disrupting the page rendering and returning an error to the client’s browser.

How did this relate to screen sessions? When you reattach a screen session, it sends a SIGWINCH to the current foreground process to tell it that the window size might have changed – this would count as an “otherwise benign” signal that might be mishandled in the customer’s application stack. Gunicorn was never going to receive an unexpected SIGWINCH, but something buried in the mysql client code might just choke when it arrives.

Coupled with years of hard-won experience, it was time to go source diving (if you’d like to play along at home, grab a copy of the mysql 5.5 source). There were a number of ways this problem could be manifesting, but he had a pretty good idea about one particular way this might happen.

Most programs need to make a lot of syscalls to the kernel in order to get work done, and mysql is no exception. Syscalls are sometimes interrupted by a signal from another process, and once the signal has been dealt with, the syscall returns. Because the syscall was in the middle of something when the signal arrived, it won’t have finished-up cleanly.

Sometimes this is okay and the syscall can be retried, and sometimes the syscall can’t be safely retried (eg. non-idempotent actions like writing data). To deal with this, the kernel returns an error from the syscall and communicates basic details of the status through errno.

It’s a fairly common mistake in programming to write optimistic code that doesn’t check for errors and just assumes that everything worked correctly all the time. In this case the code is similarly naive, but assumes that any error is necessarily fatal, which just isn’t true.

The events to produce the breakage go something like this:

  1. The gunicorn_django processes, managed by daemontools, are happily running the customer’s app
  2. 06:25am rolls around and cron fires up logrotate for the day
  3. logrotate does its thing and rotates all the gunicorn_django logs
  4. As a postrotate action, logrotate asks daemontools to send a SIGHUP to all the gunicorn_django processes so they’ll open new log files
  5. The mysql client library code in the gunicorn_django process is in the middle of a syscall. The SIGHUP interrupts the syscall and an error is returned
  6. The mysql client code receives the error, freaks out, and reports a lost connection

Why is this wrong? The code in question is receiving a signal during a read() syscall, and it’s okay to retry a read() if we were interrupted for a signal. There are plenty of errors that should be bailed on, but this isn’t one of them.

This is thankfully pretty easy to workaround with a patch, and testing began in earnest:

--- a/vio/viosocket.c   2012-07-19 17:54:19.238520119 +1000
+++ b/vio/viosocket.c   2012-07-19 17:59:40.831071280 +1000
@@ -53,7 +53,12 @@
   r = recv(vio->sd, buf, size,0);
   errno=0;                                     /* For linux */
-  r = read(vio->sd, buf, size);
+  while (1) {
+    r = read(vio->sd, buf, size);
+    if ((r == (size_t) -1) && (errno == EINTR))
+       continue;
+    break;
+  }
 #endif /* __WIN__ */
 #ifndef DBUG_OFF
   if (r == (size_t) -1)

This simply says that if we get an error during the read (a return value of -1) and the error was a signal interruption (EINTR), then ignore it and keep reading. Brilliant!

You might well ask why we didn’t see this more often and why David had so much difficulty reproducing it. It comes down to a connection being in just the right state for this to occur, and the behaviour allowed by the POSIX standard.

We mentioned earlier that a read() syscall can be safely retried. The exact semantics are that:

  • If some bytes have already been read: just return the number of bytes that were read, it’s normal behaviour to return fewer bytes than were requested
  • If no bytes have yet been read: we are blocked on the read, return -1 and set errno accordingly. We can’t return 0 because this would falsely indicate the end-of-file

So this problem only arises when the client is blocked on a read() call, waiting for the server to start returning data. We only block on a read while actively making a request, and the “no bytes sent yet” condition is only true for a short period while the server parses the requests and starts processing it.

This is why the problem couldn’t be reliably reproduced.

For the sake of science, David thoughtfully put together a little snippet of code so the signal-handling issue can be easily demonstrated. You know, in case you don’t happen to have a site that takes 150 million hits every day.

#include <stdio.h>
#include <unistd.h>
#include <signal.h>

void signalhandlerdudes(int i) {

int main() {
        char buf[BUFSIZ];
        struct sigaction sa;

        sa.sa_handler = signalhandlerdudes;
        sigaction(SIGWINCH, &sa, NULL);

        while(1) {
                int ret;

                ret = read(0, buf, BUFSIZ);
                printf("read() returned %dn", ret);

        return 0;

If you have the libc header files installed you can run gcc -Wall signalhandlerdudes.c to compile it. When run, it’ll read input from stdin and report the number of bytes from each read. If you resize the terminal window that it’s running in (or manually send a SIGWINCH to the process) while waiting for input, it’ll report -1 bytes, which is what the mysql code gets snagged on.

This isn’t the only instance of poor signal handling, digging around the MySQL lists turns up a few cases of poor behaviour when the errno is EINTR. Some go completely without attention, like this one that has had a patch and suggested fix for six years now.

In our case we’re well positioned to patch the package and roll it out to the customer’s machines. The mysql client libraries are in use in some Python virtualenvs (statically compiled), but those are easy to deal with. In the meantime we’ve reported this on the mysql lists and its been acknowledged as a regression – mysql 5.0 and 5.7 are apparently unaffected.

That’s another bug licked, after a couple of weeks of intensive searching, data collection, hypothesising and testing on David’s part. We’re proud to have such talent on staff to be able to get to the bottom of such problems and fix them for our customers. What started as a relatively small issue affecting perhaps 20 users each day out of many thousands turned out to be a deeply-rooted bug that can cause big problems – and you can see it for yourself by starting a long-running query on the commandline and resizing the window.

Reckon you could hunt this one down? We’re hiring.