Deep Bug Hunting

By January 20, 2009Technical

As practically anyone who has spent more than a couple of hours of their life using computer can attest, software is not the most reliable of human creations. Most software is very, very buggy, to the point that small errors and irritations happen so often that most people’s instinctual reaction is either to ignore them, or just apply an almost-instinctual workaround.

The cause of most bugs that aren’t ignorable are solidly at the “trivial” end of the scale, too — a config problem, usually, or at worst a fairly trivial logic bug, where the symptoms and logging information point quite clearly to the source of the problem. However, every once in a while, something really vicious comes along, and it’s those bugs that make people really, really glad that they’ve got a solid technical team behind them, rather than some fly-by-night box-shifting outfit.

This blog post describes a problem that has plagued Anchor for quite some time now, and how I fought to fix it properly, once and for all. It is, essentially, a big brag about how clever I am, with some (hopefully) interesting observations on how to go about diagnosing a particular class of tricky bug.

Step 1: Describe The Problem

The problem report initially came in as “the reseller control panel displays blank pages for certain resellers, instead of the login page they expect”. All things considered, that’s a pretty good problem description. It’s got all the elements you need:

  • how to reproduce the problem (look at the reseller control panel for one of the resellers that’s shown as having a problem);
  • what the customer is seeing (so you know that you’ve reproduced the problem accurately); and
  • what they’re expecting to see (so you know when you’ve fixed it).

This is a crucial first step in any troubleshooting endeavour — get the particulars of the “crime”. It can be a surprisingly difficult one, too, when dealing with some people. But when it comes to the crunch, you can’t start debugging until you’ve got this info, either by direct report or by inferring reality from whatever info the bug reporter did give you.

If you don’t have this info, you don’t know how to make the problem occur, you don’t really know what you’re trying to fix, and you won’t know when you’re done.

Step 2: Identify the at-fault actor

Or, in slightly less elevated terms, “work out what you’re going to have to attach the debugger to”. This isn’t so much a matter of solving the problem as finding out who is causing it, without (necessarily) quite getting into the “why” just yet.

Sometimes, this is trivial, to the point that it isn’t even an issue. But with a modern web service, there’s many different elements at play, and you can’t start really poking at things to diagnose the problem until you know what it is you have to poke. Is it a browser/display level problem? (Dodgy CSS games can produce a blank page, for example). Is the network transporting a correct and complete request/response pair? If a complete, but blank, response is getting sent over the network, who is generating it — it could be the webserver itself, or an underlying CGI / web application doing the dirty work.

In this case, after a bit of faffing around, it became clear that the cause of the problem was Apache segfaulting. How did I find this out? Because, after a few false starts, I looked in the apache error log and found “child pid XXXXX exit signal Segmentation fault“. Proving, once again, that the error logs are the place you should go to first, not after you’ve spent a half hour or so running tcpdump and strace

Knowing where the problem is coming from is very important, as it gives you somewhere to focus your debugging attentions on. However, in this case, the unbridled joy of discovery was tempered by the knowledge that segfaults are rarely simple to track down in software as large, complicated, and widely used as Apache. All the simple bugs will have been found and fixed already, meaning that this one was likely to be weird and tricky.

Step 3: Generate a minimal reproducible test case

What kept me out of the pits of despair when I found out that I was dealing with a segfault is that the problem was 100% reliable — it happened every time I performed a very simple action, immediately after Apache started. For deeply technical reasons, two seemingly identical invocations of the same program might not produce a segfault both times. Thankfully, this wasn’t the case here, so tracking down the problem might be a tedious exercise, but not a frustrating hunt for a bug that manifested itself seemingly randomly, making you think you fixed it when in fact it was just hiding.

In general, to make your life easier, you want a test case that demonstrates the problem consistently, without requiring a lot of manual effort to create the pre-conditions every time. If you’ve got to spend 20 minutes setting up the test environment every time you want to retry the test, you’ll spend weeks nailing the problem down. Also, if the test sometimes doesn’t show the problem, you can never be quite sure that you’ve fixed the problem — it might just have not shown up this time around.

In this particular case, since the segfault was reliably reproducible, but quite dependent on the versions of all the software involved as well as the configuration, it was easier to run a separate instance of the software on the same machine, with the same configuration, than it was to setup a wholly separate test system. This was a little risky, as a screw up could cause some nasty problems, but after a little hand-wringing, it was deemed worth the risk.

So, what I ended up with was a file tree with all the configuration files from the “live” installation, with the port numbers changed, along with a little script to start the program that I wanted to debug with all the options I wanted, including one to use the alternate config file location. Since Apache has a lengthy startup procedure (which running in the debugger would have slowed down considerably), I opted to start the program, wait for it to finish it’s initialisation, and only then attach the debugger to it (gdb --pid=XXXX ftw).

Sidebar: Use Open Source Software

Pretty much everything I did to fix this problem myself was only possible because Anchor bases all their infrastructure on Open Source software, like Linux, Apache, and so on. If this same problem had been present in a proprietary stack, then there would have been very little hope of identifying the problem — and no hope at all of fixing it — other than to beg the vendor for a fix, which I’m fairly certain wouldn’t have happened due to the age of the system involved and the obscurity of the bug. In fact, we tried some time ago to get the system vendor to take responsibility for the bug (since it was, after all, in the software they provided and purportedly provided support for), but they wanted nothing to do with the problem, and basically refused to look into it, so the only reason this problem is solved at all is because we had the source. The impact on our customers, particularly the periodically recurring nature of the problem, meant that this bug was not something we could ignore. How the hell can you base a customer-satisfying business on systems you can’t do anything to fix? I don’t get it.

Step 4: Produce debugging-enabled versions of programs and libraries

Compilers exist because computers and people understand very different languages. A compiler translates human-friendly languages into computer-friendly languages. Debugging symbols serve to act as a translation medium the other way, to allow people to comprehend what’s going on in the computer.

The debugging info for a given program is often twice the size of the program itself, so if everything came with the debugging symbols the size of the program on disk (and in memory) would be three times larger than it is, which has implications for memory usage and startup times. So instead, you get the incomprehensible version, which is fine for most people, but is useless for debugging. There is a nifty workaround for this problem available on modern distributions, in that you can ship the debugging information separately from the program itself and have the debugger load it, but that option didn’t exist in this case due to the age of the system.

In the absence of debuginfo packages, to get a debugging version of a piece of software, you need to rebuild. Typically, this should just involve making sure that the -g and -O0 options are passed to the compiler (-g for “generate debugging info” and -O0 for “no optimisation”)[1], that there aren’t any other -O flags being passed, and that strip isn’t being called anywhere[2].

Unfortunately, “typically” and “actually”, like most other applications of theory and practice, aren’t all that close. I actually spent more than a day getting a debugging version of openssl, and more than a little bit of time on mod_ssl. (Apache itself was the work of a few minutes, thankfully). The time involved was exacerbated by my insistence on doing things “the right way”, by backporting a generalised method of producing auxiliary “debuginfo” packages (which contain the debugging information as separate files) to the utterly ancient (though still, theoretically, vendor-supported) version of the Linux distribution we’re running on the machine that was having the problem. Once I got that stupid idea out of my head, it was only a couple of hours (and some swearing at stupid, incorrect GCC documentation) to produce symbol-enabled versions of the appropriate libraries.

Step 5: Debug the program

I used to teach introductory computer science, and I never came up with a way to “teach” debugging processes to a class of students. The mechanics of using the debugger were easy to impart, and it was fairly simple to walk through finding a bug in a student’s program with them, and after a while they got the hang of it themselves, but I never worked out how to inform a class of students on the delicate art of finding problems in their programs using a debugger in the general sense.

As a result of my failure, you’re largely on your own if you’re tracking down a random bug of your own. However, here are a few “rules of thumb” that might help you if you’re tracking down a bug similar to mine:

  • First up, get a backtrace of exactly where the segfault occurs. Identify the memory access that is causing the problem, then walk up the call stack until you work out where the bad address is really coming from.
  • NULL pointers (and their cousins, the NULL pointer offset, caused by accessing a field in a struct whose base address is 0x0) are the easiest to work out, but most of the time you can see which of several addresses is the faulty one with a bit of printing in the debugger.
  • If you’re lucky, the reason for the source of the dodgy memory access will be obvious as soon as you look at the code[3]. In that instance, fixing the problem should be relatively straightforward, too.
  • More often, though, the value that’s being passed down the chain and causing the problem is just being read out of a variable somewhere, and the value was actually set erroneously somewhere else. In that instance, you’ve got the unenviable job of working out the ultimate source of the cruft. Here I can offer no general advice, except perhaps gdb watchpoints and a lot of grepping through the source code and setting of breakpoints at likely points in the code. If you get discouraged and the problem seems impossible, just remember that the computer isn’t doing this to piss you off, and everything it does is fully described by the code you’re looking at, even though the code probably isn’t obvious in what it’s doing.

These rules are ridiculously general. Every situation is wildly different. The only thing I’m sure of is that the more code you’ve written and stared at, the easier debugging is, simply because you’ve made (and seen) more boneheaded mistakes, and therefore can recognise the sorts of wrong things that programmers think of.

Step 6: Fix the problem

What irritates me the most about debugging a problem like this is that once you’ve found the problem, the fix is typically trivial — a slight adjustment to the type of a variable, or slightly modifying the way a function is called.

But once that’s done, you’ve got the entertaining job of feeding the fix back to the community, so that:

  • Others can benefit from your effort. The software you’re using is only as good as it is because lots of other people have spent time, like you, tracking down bugs, fixing them, and sending patches back.
  • The next release of the software doesn’t have the same bug again, so you don’t have to continually fix the problem again and again in every new release.

Adieu

Hopefully you’ve now got a slightly better idea of how you might go about finding a tricky segfault bug of your own. If it all looks a horribly daunting, don’t worry too much — not everyone has to know how to fix these bugs, you’ve just got to hire professionals who do, rather than numbskulls who think that installing a cracked version of plesk on the Windows box in the corner of their bedroom makes them a hosting company.


[1] Turning off optimisation is a bit of a risk, since some bugs only appear in certain optimisation levels (and may $DEITY have mercy on your soul if you hit one of those!), but optimised code doesn’t quite map to the debugging symbols and source code you’re working from, which makes your job sooooo much harder.

[2] The strip command is used to remove the debugging information from a binary, and is usually run at the end of the packaging process to make sure that no debugging-enabled programs “leak” out into the wider world, which is of course completely backwards for what we want to do in this particular case.

[3] Incorrect pointer arithmetic, for example, is a nice way to end up in segfault country. I found another segfault in Apache a while ago because it was calculating an offset in memory against a char variable, but chars aren’t guaranteed to be unsigned by default and hence it was going backwards for character values > 127, but only on platforms where chars are signed). Whoops…