The sysadmin’s essential diagnostic toolkit

By June 29, 2012 Technical

We’ve had a number of people ask us recently what sort of procedures and tricks we use when hunting down problems on systems we maintain, as a lot of the work can seem magical at times. While there’s no short answers to these sorts of questions (you could fill many, many pages with the topic), we thought we’d share some of the most commonly-used tools with you.

You may already know a couple of them, which is great. Their real value comes from knowing how and when to use them.

strace

This is the big one for us, because it so frequently tells us exactly what we need to know. strace hooks into a process using the kernel’s ptrace facility and prints a list of all the syscalls made by the process. Because processes can’t do much except crunch numbers without making syscalls, you very quickly get insight into just what a process is doing.

Common things like opening a file, mapping some memory, writing to a file descriptor, accessing the network – all these require syscalls. The output of strace tends to be very verbose, but you don’t need to understand everything for it to be helpful. Most of the time, depending on your problem, you’re looking for just a couple of important syscalls.

As a really simple and contrived example, let’s say that you have a program (possibly named cat) that’s meant to read a file and print it out, but for some reason is completely silent and doesn’t tell you why it failed. You might run it with strace and look for signs of the problem near the end of the output, like so:

riri@ayakashi:~% strace cat my_file
...
... lots of output here as the shell sets up the execution environment ...
...
close(3)                                = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
open("my_file", O_RDONLY)          = -1 ENOENT (No such file or directory)
close(1)                                = 0
exit_group(1)                           = ?

We’ve trimmed a little bit of the output, but you can see that my_file doesn’t exist because the call to open() fails, and the program terminates without saying anything.

This is a heavily simplified example to demonstrate how you can get a peek under the hood and see what’s really happening when things don’t work as expected. A more realistic use-case might be to inspect a live process and see why it’s not going anywhere. Note that in this example, we use the -p flag to attach to an already-running process (PID 2501) and see what it’s in the middle of.

riri@ayakashi:~% strace -p 2501
Process 2501 attached - interrupt to quit
flock(42, LOCK_EX

The process is blocked while waiting for an exclusive lock on file-descriptor 42. This may tell us what we need to know, or we may need to do a little digging in /proc to reveal what it’s waiting for. Once you’re familiar with your systems, the name of the syscall alone will often tell you what you need to know to continue your diagnosis.

lsof

lsof is another firm favourite in the toolbox thanks to its versatility. lsof is short for “list open files”, which is exactly what it does. Without any arguments it’ll tell you about everything on the system, but that’s often not too useful.

We’re usually asking one of two questions when we use lsof:

  1. What files does this process have open?
  2. What files are open inside a given directory?

In the first instance we want to know what files are being read/written while the process is running, often they’re log files. This is great when you know a process should be logging its activity, but you can’t find the logfile because the config is a horrible mess of XML files. As a bonus, network connections also appear in the output from lsof, making it easy to look for activity.

lsof also prints all the libraries in use by the process, but these can usually be ignored. The 4th column list file-descriptor numbers, which is what we really care about. Here’s part of the output from the pidgin IM client:

riri@ayakashi:~% lsof -p 4534
COMMAND  PID USER    FD  TYPE SIZE/OFF     NODE NAME
pidgin  4534 riri    0r   CHR      0t0      963 /dev/null
pidgin  4534 riri    1u   REG    10367  3962676 /home/riri/.xsession-errors (nfs1.shirakiin.net:/home)
pidgin  4534 riri    2u   REG    10367  3962676 /home/riri/.xsession-errors (nfs1.shirakiin.net:/home)
pidgin  4534 riri    3r  unix      0t0  1253085 socket
pidgin  4534 riri    4u  unix      0t0  1253086 socket
pidgin  4534 riri    7r  IPv4      0t0      TCP ayakashi.shirakiin.net:35883->ldap.shirakiin.net:ldap (ESTABLISHED)
pidgin  4534 riri    8w  unix      0t0  1253100 socket
pidgin  4534 riri    9r  unix      0t0  1253108 socket
pidgin  4534 riri   10w  IPv4      0t0      TCP ayakashi.shirakiin.net:33796->msg.shirakiin.net:xmpp (ESTABLISHED)
pidgin  4534 riri   15u  FIFO      0t0  1253113 pipe
pidgin  4534 riri   16u  FIFO      0t0  1253113 pipe
pidgin  4534 riri   24u  IPv4      0t0      TCP ayakashi.shirakiin.net:40821->msg.shirakiin.net:xmpp-client (ESTABLISHED)
pidgin  4534 riri   30u   REG    23771 22667516 /home/riri/.purple/logs/jabber/riri@msg.shirakiin.net/coffee@conference.msg.shirakiin.net.chat/2012-06-29.112723+1000EST.txt (nfs1.shirakiin.net:/home)
pidgin  4534 riri   31u   REG    26468 26479347 /home/riri/.purple/logs/jabber/riri@msg.shirakiin.net/youkai@conference.msg.shirakiin.net.chat/2012-06-29.112723+1000EST.txt (nfs1.shirakiin.net:/home)

In addition identifying the active network connections, lsof handily tells us which logfiles are open, at the end of the list.

As mentioned, lsof can also be used to find open files in a directory. This is useful when the directory is actually a mountpoint for a filesystem, and we’re trying to find out what’s stopping us from unmounting the filesystem.

riri@ayakashi:~% umount /mnt/usb_drive
umount: /mnt/usb_drive: device is busy.
        (In some cases useful info about processes that use
         the device is found by lsof(8) or fuser(1))


riri@ayakashi:~% lsof +D /mnt/usb_drive
COMMAND   PID USER   FD   TYPE SIZE/OFF NAME
zsh     15660 riri  cwd    DIR    12288 /mnt/usb_drive
vim     15750 riri  cwd    DIR    12288 /mnt/usb_drive
vim     15750 riri   5u    REG    16384 /mnt/usb_drive/.love_letter.tex.swp

Whoops! I forgot I was in the middle of writing that…

ldd

ldd tells you which shared libraries a binary depends on. We use it pretty rarely but it’s handy to confirm your suspicions sometimes – let’s look at a real case we saw recently.

A customer’s server runs Postfix, like almost all of our machines, and was failing to deliver mail to some destinations. The postfix logs showed that the the smtp client process was bombing out, and a little more diagnosis revealed that the problem only occurred when the destination supported TLS encryption.

Google eventually led us to a post on the postfix-users mailing list describing the exact same symptoms. Wietse suggested checking the library dependencies using ldd, which revealed the same apparent cause as well: the Percona mysql daemon.

riri@ayakashi:~% ldd /usr/libexec/postfix/smtp
	libldap-2.4.so.2 => /lib64/libldap-2.4.so.2
	liblber-2.4.so.2 => /lib64/liblber-2.4.so.2
	libpcre.so.0 => /lib64/libpcre.so.0
	libmysqlclient.so.16 => /usr/lib64/libmysqlclient.so.16   <-- this one here
	libm.so.6 => /lib64/libm.so.6
	libsasl2.so.2 => /usr/lib64/libsasl2.so.2
	libssl.so.10 => /usr/lib64/libssl.so.10
	libcrypto.so.10 => /usr/lib64/libcrypto.so.10
	libdl.so.2 => /lib64/libdl.so.2
	... lots more here ...

riri@ayakashi:~% rpm -qf /usr/lib64/libmysqlclient.so.16
Percona-Server-shared-compat-5.5.24-rel26.0.256.rhel6.x86_64

Too easy! That pretty much nailed it as far as we were concerned, based on the rest of the thread. However, to verify the hypothesis, we’ll use the next scheduled maintenance window to try some vanilla mysql packages and see if the problem goes away. Until then, we’ve disabled opportunistic encryption for outbound mail and everything is working nicely.

netstat

netstat is a handy little tool for inspecting network socket activity. It’s fairly well known thanks to its inclusion in Windows, but it’s much more useful on Linux.

We mostly use netstat to inspect listening ports on a system. One of the lesser known features of netstat is to get the process ID and name of a listener – an example will best demonstrate this.

We suspect this machine is running a hidden IRC server and participating in a botnet, thanks to a tip-off from our netsec division. The flags we use (-tnlp) respectively mean: TCP, Numeric output, Listening sockets only, Process details.

root@misaka:~# netstat -tnlp
Active Internet connections (only servers)
Proto Local Address    Foreign Address    State       PID/Program name
tcp   127.0.0.1:3306   0.0.0.0:*          LISTEN      1890/mysqld
tcp   0.0.0.0:21       0.0.0.0:*          LISTEN      1713/vsftpd
tcp   0.0.0.0:22       0.0.0.0:*          LISTEN      1863/sshd
tcp   127.0.0.1:25     0.0.0.0:*          LISTEN      2202/master
tcp   0.0.0.0:5666     0.0.0.0:*          LISTEN      1743/nrpe
tcp   0.0.0.0:6667     0.0.0.0:*          LISTEN      1720/.zz

That last process looks mighty suspicious! Poking through procfs, we can find the location of the executable and uproot it properly.

root@misaka:~# ls -lh /proc/1720/exe
lrwxrwxrwx 1 root root 0 Jun 27 21:12 /proc/1720/exe -> /tmp/a/.zz

Find any other mess it might’ve left behind, kill the process and then nuke it all from orbit. Mission accomplished.

tail

I know what you’re thinking, “tail isn’t that cool, it’s like having head but upside-down”. We say otherwise.

tail is invaluable for reading logfiles while performing testing, by invoking it with the -f flag to follow the end of the file as it’s written to.

It’s probably most commonly used here at Anchor to hunt down mail problems. Once running, we’ll go ahead and send test mails through the system, watching for them to be processed and logged.

tail -f /var/log/maillog

tail can be used to follow multiple logfiles simultaneously as well, which is helpful for processes that split their output to multiple files based on loglevel (errors vs. info) or destination (eg. apache’s access_log and error_log). Used in the right situations, tail-following makes diagnosis much, much faster and more convenient for the sysadmin.


There you have it, that’s our Top-5 for Linux systems. Unfortunately they can’t solve your problems for you, but being proficient in their usage will go a long way when it comes to diagnosing a problem without an obvious cause.

Disagree? Do you think there are other tools that should’ve taken a spot on this list? Leave us a comment or get in touch, we’d love to hear from you.

4 Comments

  • SaveTheRbtz says:

    You can go deeper:
    0) gdb(1): Allows to get inside an app
    1) mincore(2): Helps to debug page cache problems
    2) pref(1): Debugs general performance problems
    3) SystemTap: Do not rebuild kernel for purpose of simple instrumentation (printfs)
    4) /proc/ and /sys/: “Not that cool” but provides good metrics for optimizing system performance

  • M0thr4 says:

    My favourite, for troubleshooting disk space issues:

    du -sk * | sort -nr | head

    (shows the top ten filesystem hogs in any directory)

  • M0thr4 says:

    The ordering of a top 5 is always going to provoke controversy, but I was surprised not to see vmstat in there.

    🙂

    • Davy Jones says:

      vmstat (and iostat along with it) would both be pretty high up there, but we find they’re not quite as useful. If nothing else, we’ve found that being a sysadmin tends to hone your senses, you can practically feel it in your bones when a server is swapping itself to death. Don’t need no vmstat for that! 😉

      Good call on the du, the popular alternative around here is du --max=1 /path/to/foo