A gentle intro to bughunting

September 23, 2013 Technical, General

A lot of the bughunting that we talk about here is pretty involved, and requires in-depth knowledge of the systems and conventions in play. It’s not exactly conducive to learning if you’re just trying to get started, so we thought we’d take the opportunity to walk through a small bug that we found the other day. It’s really basic, is limited to userspace, and only needs a couple of common tools. We’ll assume that you know a little bit of C and have used gdb to poke around your own code before, but are stuck when it comes to real-world problems.

Something that we deploy heavily at Anchor is daemontools, written by Dan Bernstein (aka. DJB). It keeps services running and works in conjunction with a few other small utilities to get the job done. The unix-fu is strong in this one: DJB writes small pieces of software that do one thing very well, and they’re designed to hook together to produce a robust stack.

daemontools is usually used together with multilog for logging, which writes timestamps in the tai64n format. tai64n timestamps aren’t human readable, but DJB includes a little helper app, tai64nlocal, to convert them into your local timezone in ISO format. In keeping with the unix philosophy, tai64nlocal is really simple: it reads lines from standard input, attempts to convert any tai64n timestamps that it finds, then pushes the line to standard output.

One of our code-cutters, Christian, was using tai64nlocal while inspecting some Unicorn logs (another app that we’re quite fond of), when it spontaneously segfaulted. A segmentation fault is caused when the program attempts to access memory that it doesn’t own, commonly as a result of using the return value of a function call without sanity-checking it. That’s unexpected, particularly from DJB software, which is usually very sturdy. This was enough to pique our curiosity, we’d never seen tai64nlocal segfault before…

We start by running dmesg to get some details of the segfault, as the OS logs an entry when a segfault occurs.

tai64nlocal[13319]: segfault at 14 ip 000000000040061f sp 00007fff828fcb80 error 4 in tai64nlocal[400000+2000]

This doesn’t tell us anything immediately, but we can see that the Instruction Pointer was at 0x40061f when the segfault occurred. If debugging symbols are available (the -g flag was given during compilation, and the binary hasn’t been stripped) we can use this to look up the line number in the source that caused the segfault.

As an example of how this would look on a binary with debugging symbols:

[email protected]:~/dev/segfault%  addr2line -e segfault-test 0x40061f

That’d take us right to the bug! We’ve got no such luck on our distro binaries, as they’re stripped to save space in distribution, so we get an unknown answer:

[email protected]:~%  addr2line -e /usr/bin/tai64nlocal 0x40061f

We’re going to need to try something different – let’s bust out gdb, the GNU Debugger, we’ll have to inspect the raw assembly (ASM). Fire up gdb and enable disassembly of subsequent lines so you can see what’s about to be executed.

[email protected]:~/dev/segfault%  gdb /usr/bin/tai64nlocal
GNU gdb (GDB) 7.4.1-debian
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
Reading symbols from /usr/bin/tai64nlocal...(no debugging symbols found)...done.

(gdb) set disassemble-next-line on

Now tai64nlocal is loaded up, but not yet running. We know where the segfault occurred, so let’s inspect the ASM around that address. We’re using the eXamine command, specifying that we’d like to see 5 Instructions:

(gdb) x/5i 0x40061f
   0x40061f:	mov    0x14(%rax),%esi
   0x400622:	mov    $0x601ae0,%edi
   0x400627:	mov    %rax,0x2014da(%rip)        # 0x601b08
   0x40062e:	add    $0x76c,%esi
   0x400634:	movslq %esi,%rsi

The first mov instruction is what causes the segfault, but we’d actually like to see a little before that, so we’ll subtract some bytes and look again:

(gdb) x/5i 0x40061f-10
   0x400615:	sbb    $0x2014f6,%eax
   0x40061a:	callq  0x4004e0 <[email protected]>
   0x40061f:	mov    0x14(%rax),%esi
   0x400622:	mov    $0x601ae0,%edi
   0x400627:	mov    %rax,0x2014da(%rip)        # 0x601b08

Aha, there’s a call to a function named localtime right before the segfault. If you’re familiar with calling conventions, you’ll know that the CPU register rax is used for the return value of calls. The segfaulting instruction appears to make use of rax, which is a good clue that something unexpected happened in the call to localtime.

Let’s have a quick look at what the localtime() function returns. If you have the development manpages installed (the manpages-dev package on Debian-type systems, and man-pages on Redhat-types) you can simply run man 3 localtime to see for yourself.

The Synopsis tells us that localtime returns a pointer to a struct tm, and the Description gives a breakdown as follows:

/* You can also find this in the kernel source tree at include/linux/time.h */
struct tm {
    int tm_sec;         /* seconds */ 
    int tm_min;         /* minutes */ 
    int tm_hour;        /* hours */
    int tm_mday;        /* day of the month */
    int tm_mon;         /* month */
    int tm_year;        /* year */ 
    int tm_wday;        /* day of the week */
    int tm_yday;        /* day in the year */
    int tm_isdst;       /* daylight saving time */

Without knowing exactly how a struct tm is used, it’s already evident that it’s a data structure for storing components of timestamps. With this knowledge, let’s have a look at the source code, which is readily available from DJB’s site. There’s no surprises in the layout here, the relevant code is tucked away in admin/daemontools-0.76/src/tai64nlocal.c and is nice and short. There’s only one mention of localtime, on line 55 (as of version 0.76); experienced coders will already know what the problem is just from viewing the source. 🙂

Nevertheless, we’re going to poke around and see what’s going on, so let’s get cracking in gdb. We set a breakpoint on the segfaulting mov instruction, specifying its address directly, then begin execution:

(gdb) break *0x40061f
Breakpoint 1 at 0x40061f
(gdb) run
Starting program: /usr/bin/tai64nlocal

gdb will now appear to hang here – tai64nlocal is waiting for input. We know it’s expecting a log entry with a tai64n timestamp, and the source makes it clear that we can get into that codepath by feeding it a line that begins with an ‘@’ symbol. We’ll give it something lazy but vaguely plausible, and see what happens:


Breakpoint 1, 0x000000000040061f in ?? ()
=> 0x000000000040061f:	8b 70 14	mov    0x14(%rax),%esi

Huzzah, we’ve hit our breakpoint. This instruction is rendered in AT&T syntax, which accesses the memory address stored in register rax, offset by 0x14, and writes that into register esi. Let’s inspect the value of rax, which should be a pointer to a struct tm data structure.

(gdb) info registers rax
rax            0x0	0

Uh-oh, that doesn’t look right! A pointer is meant to be a memory address, and an address of zero would make that a null pointer. We’re definitely not going to find a usable struct tm there. You’ll also recall that the kernel logged an address of 14 in the segfault entry, which correlates with the attempted access here.

Let’s look at that data structure again, with some slightly different annotations this time:

struct tm {
    int tm_sec;         /* offset 0x00 */
    int tm_min;         /* offset 0x04 */
    int tm_hour;        /* offset 0x08 */
    int tm_mday;        /* offset 0x0c */
    int tm_mon;         /* offset 0x10 */
    int tm_year;        /* offset 0x14 */
    int tm_wday;        /* offset 0x18 */
    int tm_yday;        /* offset 0x1c */
    int tm_isdst;       /* offset 0x20 */

The very next line of tai64nlocal.c refers to tm_year, which is at offset 0x14 in the struct tm – that cements the direct cause of the segfault:

out(num,fmt_ulong(num,1900 + t->tm_year));

The next question is why we’re dealing with a null pointer; the manpage for localtime tells us:

…returns the value described, or NULL in case an error was detected.

We’ll leave it as an exercise for the reader to discover exactly why an error occurred with the input we provided (What does the expected input look like? How is it parsed? How does it know when to stop reading?). In any case, it’s a simple matter to deal with localtime() freaking out and returning a NULL value.

      t = localtime(&secs);
      if (t == NULL) break;

This is now the same as if we’d hit an unexpected character earlier during the initial parsing. If localtime hits an error condition we just bail out of the timestamp-handling code, and print the rest of the characters on the line verbatim. This sort of sanity-checking is especially important in C, which is more than happy to let you shoot yourself in the foot. Now that we have a fix, we pass a bug report and the patch upstream to be integrated into the official source tree.

As for the original manifestation of the segfault, it looks like a corrupted line in a logfile was the cause, which matches up neatly with our analysis here.

Further reading