Nagios plugins: a two minute hate

Published January 20th, 2009 by matt

If you asked one of your friends how their parents were doing (assuming that their parents were nominally alive, and that you had an appropriate degree of friendship that permitted such social intercourse), and they replied “I’m not sure, I haven’t seen them in a while”, is it reasonable for you to reply with a statement of your condolences, on the assumption that they’re dead?

No, of course not. That would be foolish. Whilst it is possible that the reason why your friend hasn’t seen them in a while is because they’re dead, and it’s possible that they’ve died without your friend being aware of it, there is no practical reason to believe that your friend’s parents are, in fact, deceased. The number and probability of the non-death-related reasons why your friend hasn’t seen their parents in a while far outweigh the death-related reasons.

Given this fairly straightforward logic, why do Nagios plugins insist that practically any inability to check whether a service is OK or not results in a critical alert? Network error? That’s critical. Plugin timeout? That’s critical. Criticising the false critical? Oh, you better believe that’s critical.

A critical alert should mean “OMG, this is down, you need to have a look at this”. It should not mean “hmm, the machine might be a bit loaded at the moment and isn’t responding quite quick enough for my liking”. If you want to be alerted in that instance, then you can tell Nagios to alert you for “unknown” events. Making it impossible for an alerting system to distinguish between “your disk is full!” with “I couldn’t find out whether your disk is full” is ridiculously annoying.

As it stands, the ability to respond to actual problems in a timely manner is greatly diminished by these false alerts. Your choice is either get woken up for hundreds of false alarms for every actual, needs-to-be-dealt-with problem, or retry your service checks for so long to reduce the chances of a false positive that you don’t know that something’s broken for such a long time that customers notice the problem before your monitoring system does. Either way, it’s annoying, pointless, and makes big dents in the utility of your monitoring system.

So, my self-appointed task for the train ride home — patch a few critical checks to produce an unknown when it doesn’t know if the service is down, rather than assuming the worst, and freaking everyone out with premature notice of their parents’ demise.

Posted in WTF

 Leave a comment

1
Comment

New co-location suite in Global Switch

Published January 20th, 2009 by Davy Jones

On the back of a very strong 2008 for Anchor and despite the doom and gloom that has been flooding the media we’ve taken the plunge and decided to double our rack capacity in Global Switch’s Sydney facility.

We’ve picked up capacity for about 29 standard (600mm wide) racks but have decided to do a fitout with the larger 750mm x 1070 mm APC server racks. The large racks should make server installation and cable management that bit easier, as well as helping with cooling. One of the nice fringe benefits of the APC racks is the oh so simple mounting of the managed power rails (which we use on all or co-location). Saves having to fiddle with custom mounting brackets.

We’re hoping this new space will see us through for some time although early indications seem to suggest that the space will sell fast!

We’re madly completing the fitout and expect to be live within about 2 weeks from now. Photos will be posted as work progresses.

0
Comments

Deep Bug Hunting

Published January 20th, 2009 by matt

As practically anyone who has spent more than a couple of hours of their life using computer can attest, software is not the most reliable of human creations. Most software is very, very buggy, to the point that small errors and irritations happen so often that most people’s instinctual reaction is either to ignore them, or just apply an almost-instinctual workaround.

The cause of most bugs that aren’t ignorable are solidly at the “trivial” end of the scale, too — a config problem, usually, or at worst a fairly trivial logic bug, where the symptoms and logging information point quite clearly to the source of the problem. However, every once in a while, something really vicious comes along, and it’s those bugs that make people really, really glad that they’ve got a solid technical team behind them, rather than some fly-by-night box-shifting outfit.

This blog post describes a problem that has plagued Anchor for quite some time now, and how I fought to fix it properly, once and for all. It is, essentially, a big brag about how clever I am, with some (hopefully) interesting observations on how to go about diagnosing a particular class of tricky bug.

Step 1: Describe The Problem

The problem report initially came in as “the reseller control panel displays blank pages for certain resellers, instead of the login page they expect”. All things considered, that’s a pretty good problem description. It’s got all the elements you need:

  • how to reproduce the problem (look at the reseller control panel for one of the resellers that’s shown as having a problem);
  • what the customer is seeing (so you know that you’ve reproduced the problem accurately); and
  • what they’re expecting to see (so you know when you’ve fixed it).

This is a crucial first step in any troubleshooting endeavour — get the particulars of the “crime”. It can be a surprisingly difficult one, too, when dealing with some people. But when it comes to the crunch, you can’t start debugging until you’ve got this info, either by direct report or by inferring reality from whatever info the bug reporter did give you.

If you don’t have this info, you don’t know how to make the problem occur, you don’t really know what you’re trying to fix, and you won’t know when you’re done.

Step 2: Identify the at-fault actor

Or, in slightly less elevated terms, “work out what you’re going to have to attach the debugger to”. This isn’t so much a matter of solving the problem as finding out who is causing it, without (necessarily) quite getting into the “why” just yet.

Sometimes, this is trivial, to the point that it isn’t even an issue. But with a modern web service, there’s many different elements at play, and you can’t start really poking at things to diagnose the problem until you know what it is you have to poke. Is it a browser/display level problem? (Dodgy CSS games can produce a blank page, for example). Is the network transporting a correct and complete request/response pair? If a complete, but blank, response is getting sent over the network, who is generating it — it could be the webserver itself, or an underlying CGI / web application doing the dirty work.

In this case, after a bit of faffing around, it became clear that the cause of the problem was Apache segfaulting. How did I find this out? Because, after a few false starts, I looked in the apache error log and found “child pid XXXXX exit signal Segmentation fault“. Proving, once again, that the error logs are the place you should go to first, not after you’ve spent a half hour or so running tcpdump and strace

Knowing where the problem is coming from is very important, as it gives you somewhere to focus your debugging attentions on. However, in this case, the unbridled joy of discovery was tempered by the knowledge that segfaults are rarely simple to track down in software as large, complicated, and widely used as Apache. All the simple bugs will have been found and fixed already, meaning that this one was likely to be weird and tricky.

Step 3: Generate a minimal reproducible test case

What kept me out of the pits of despair when I found out that I was dealing with a segfault is that the problem was 100% reliable — it happened every time I performed a very simple action, immediately after Apache started. For deeply technical reasons, two seemingly identical invocations of the same program might not produce a segfault both times. Thankfully, this wasn’t the case here, so tracking down the problem might be a tedious exercise, but not a frustrating hunt for a bug that manifested itself seemingly randomly, making you think you fixed it when in fact it was just hiding.

In general, to make your life easier, you want a test case that demonstrates the problem consistently, without requiring a lot of manual effort to create the pre-conditions every time. If you’ve got to spend 20 minutes setting up the test environment every time you want to retry the test, you’ll spend weeks nailing the problem down. Also, if the test sometimes doesn’t show the problem, you can never be quite sure that you’ve fixed the problem — it might just have not shown up this time around.

In this particular case, since the segfault was reliably reproducible, but quite dependent on the versions of all the software involved as well as the configuration, it was easier to run a separate instance of the software on the same machine, with the same configuration, than it was to setup a wholly separate test system. This was a little risky, as a screw up could cause some nasty problems, but after a little hand-wringing, it was deemed worth the risk.

So, what I ended up with was a file tree with all the configuration files from the “live” installation, with the port numbers changed, along with a little script to start the program that I wanted to debug with all the options I wanted, including one to use the alternate config file location. Since Apache has a lengthy startup procedure (which running in the debugger would have slowed down considerably), I opted to start the program, wait for it to finish it’s initialisation, and only then attach the debugger to it (gdb --pid=XXXX ftw).

Sidebar: Use Open Source Software

Pretty much everything I did to fix this problem myself was only possible because Anchor bases all their infrastructure on Open Source software, like Linux, Apache, and so on. If this same problem had been present in a proprietary stack, then there would have been very little hope of identifying the problem — and no hope at all of fixing it — other than to beg the vendor for a fix, which I’m fairly certain wouldn’t have happened due to the age of the system involved and the obscurity of the bug. In fact, we tried some time ago to get the system vendor to take responsibility for the bug (since it was, after all, in the software they provided and purportedly provided support for), but they wanted nothing to do with the problem, and basically refused to look into it, so the only reason this problem is solved at all is because we had the source. The impact on our customers, particularly the periodically recurring nature of the problem, meant that this bug was not something we could ignore. How the hell can you base a customer-satisfying business on systems you can’t do anything to fix? I don’t get it.

Step 4: Produce debugging-enabled versions of programs and libraries

Compilers exist because computers and people understand very different languages. A compiler translates human-friendly languages into computer-friendly languages. Debugging symbols serve to act as a translation medium the other way, to allow people to comprehend what’s going on in the computer.

The debugging info for a given program is often twice the size of the program itself, so if everything came with the debugging symbols the size of the program on disk (and in memory) would be three times larger than it is, which has implications for memory usage and startup times. So instead, you get the incomprehensible version, which is fine for most people, but is useless for debugging. There is a nifty workaround for this problem available on modern distributions, in that you can ship the debugging information separately from the program itself and have the debugger load it, but that option didn’t exist in this case due to the age of the system.

In the absence of debuginfo packages, to get a debugging version of a piece of software, you need to rebuild. Typically, this should just involve making sure that the -g and -O0 options are passed to the compiler (-g for “generate debugging info” and -O0 for “no optimisation”)[1], that there aren’t any other -O flags being passed, and that strip isn’t being called anywhere[2].

Unfortunately, “typically” and “actually”, like most other applications of theory and practice, aren’t all that close. I actually spent more than a day getting a debugging version of openssl, and more than a little bit of time on mod_ssl. (Apache itself was the work of a few minutes, thankfully). The time involved was exacerbated by my insistence on doing things “the right way”, by backporting a generalised method of producing auxiliary “debuginfo” packages (which contain the debugging information as separate files) to the utterly ancient (though still, theoretically, vendor-supported) version of the Linux distribution we’re running on the machine that was having the problem. Once I got that stupid idea out of my head, it was only a couple of hours (and some swearing at stupid, incorrect GCC documentation) to produce symbol-enabled versions of the appropriate libraries.

Step 5: Debug the program

I used to teach introductory computer science, and I never came up with a way to “teach” debugging processes to a class of students. The mechanics of using the debugger were easy to impart, and it was fairly simple to walk through finding a bug in a student’s program with them, and after a while they got the hang of it themselves, but I never worked out how to inform a class of students on the delicate art of finding problems in their programs using a debugger in the general sense.

As a result of my failure, you’re largely on your own if you’re tracking down a random bug of your own. However, here are a few “rules of thumb” that might help you if you’re tracking down a bug similar to mine:

  • First up, get a backtrace of exactly where the segfault occurs. Identify the memory access that is causing the problem, then walk up the call stack until you work out where the bad address is really coming from.
  • NULL pointers (and their cousins, the NULL pointer offset, caused by accessing a field in a struct whose base address is 0x0) are the easiest to work out, but most of the time you can see which of several addresses is the faulty one with a bit of printing in the debugger.
  • If you’re lucky, the reason for the source of the dodgy memory access will be obvious as soon as you look at the code[3]. In that instance, fixing the problem should be relatively straightforward, too.
  • More often, though, the value that’s being passed down the chain and causing the problem is just being read out of a variable somewhere, and the value was actually set erroneously somewhere else. In that instance, you’ve got the unenviable job of working out the ultimate source of the cruft. Here I can offer no general advice, except perhaps gdb watchpoints and a lot of grepping through the source code and setting of breakpoints at likely points in the code. If you get discouraged and the problem seems impossible, just remember that the computer isn’t doing this to piss you off, and everything it does is fully described by the code you’re looking at, even though the code probably isn’t obvious in what it’s doing.

These rules are ridiculously general. Every situation is wildly different. The only thing I’m sure of is that the more code you’ve written and stared at, the easier debugging is, simply because you’ve made (and seen) more boneheaded mistakes, and therefore can recognise the sorts of wrong things that programmers think of.

Step 6: Fix the problem

What irritates me the most about debugging a problem like this is that once you’ve found the problem, the fix is typically trivial — a slight adjustment to the type of a variable, or slightly modifying the way a function is called.

But once that’s done, you’ve got the entertaining job of feeding the fix back to the community, so that:

  • Others can benefit from your effort. The software you’re using is only as good as it is because lots of other people have spent time, like you, tracking down bugs, fixing them, and sending patches back.
  • The next release of the software doesn’t have the same bug again, so you don’t have to continually fix the problem again and again in every new release.

Adieu

Hopefully you’ve now got a slightly better idea of how you might go about finding a tricky segfault bug of your own. If it all looks a horribly daunting, don’t worry too much — not everyone has to know how to fix these bugs, you’ve just got to hire professionals who do, rather than numbskulls who think that installing a cracked version of plesk on the Windows box in the corner of their bedroom makes them a hosting company.


[1] Turning off optimisation is a bit of a risk, since some bugs only appear in certain optimisation levels (and may $DEITY have mercy on your soul if you hit one of those!), but optimised code doesn’t quite map to the debugging symbols and source code you’re working from, which makes your job sooooo much harder.

[2] The strip command is used to remove the debugging information from a binary, and is usually run at the end of the packaging process to make sure that no debugging-enabled programs “leak” out into the wider world, which is of course completely backwards for what we want to do in this particular case.

[3] Incorrect pointer arithmetic, for example, is a nice way to end up in segfault country. I found another segfault in Apache a while ago because it was calculating an offset in memory against a char variable, but chars aren’t guaranteed to be unsigned by default and hence it was going backwards for character values > 127, but only on platforms where chars are signed). Whoops…

Posted in FTW

 Leave a comment

0
Comments

sock.receive()

Published January 20th, 2009 by Barney Desmond

Well, I’m impressed. My socks arrived, and just in time for my trip overseas, where I fully expect to deal with lots of snow. Delivery took one week, though one might even argue they’re a bit too serious about expediency. One pair arrived yesterday in its own box, followed by three more individual packages today.

Suddenly, socks! Thousands of them!

Suddenly, socks! Thousands of them!

I see they’ve embraced a package-based model for their postal architecture. Not a bad move, but you have to consider the risks, like out-of-order arrival (not a big problem for socks), lossage (high retransmission cost), jitter (particularly important for sockscription customers), etc. Latency rocks, though!

Tags: , ,
Posted in FTW

 Leave a comment

0
Comments

Keeping your kernel output safe

Published January 20th, 2009 by matt

Keeping logs of the operation of your system(s) is really important; when something goes wrong in the middle of the night, a good log can give you all the information you need to diagnose and fix the problem before it happens again.

One area of your system that’s quite crucial to keep, but which is often forgotten, is your kernel’s dmesg output. This is all of the messages that come direct from the kernel, from “filesystem mounted” to “Aiee! Penguins on the SCSI bus!” or “lp0: on fire”. While the former isn’t so important to keep for posterity, when the kernel crashes, you really do want to capture that output, but you can’t use your system’s normal logging because when the kernel dies, it usually takes userland processes like syslogd with it.

Enter: the netconsole. This is a nice little kernel module that’s been around for years, but isn’t widely used, probably because (a) nobody knows about it, and (b) it’s not so simple to setup. While this blog post is a feeble attempt to solve (a), Karsten M. Self has done a good job of assisting with (b) in a recent post to the linux-elitists mailing list. I encourage everyone to take a look.

Tags: , ,
Posted in FTW

 Leave a comment

0
Comments

Linux Conf Au 2009 Hobart – Day 1

Published January 20th, 2009 by oliver

What a feeling to be back amongst the sights, sounds and ah yes the smells of Linux Conf Au once more. It doesn’t seem so long ago that I was in Melbourne enjoying the fruits of the last conference. So what does this year have to offer? I am fortunate enough to primarily enjoy presentations that also most benefit my role as a Systems Administrator. There are plenty of great presentations though on other topics such as mobile/embedded devices and multimedia. The quality is high but I don’t tend to enjoy these as much.

The first day (for me at least) offered a few ups and a few downs. “Is Parallel Programming Hard, And If So, Why?” by Paul McKenney was more of a philosophical look into the reasoning around parallel programming and despite not really diving into the “how” side of things was enjoyable.

A couple of talks on systems provisioning and automation left me desiring a bit more though. I feel like what was presented was stuff that we all should have known years ago. Still it is good for people to be pushing Kickstart and Puppet which are both worthy tools, but that doesn’t mean I can’t hunger for the next big development.

“Security-Enhanced PostgreSQL” was certainly an eye opener, but given our experiences with SELinux (around which the SEPostgreSQL project is based) lead me to believe the integration will take quite some time to be completely usable in all scenarios, and being a security product it will have to meet that criteria before it will gain widespread acceptance. Something to keep an eye on though.

The final presentation I attended was “Rails Deployment In The Enterprise” by Robert Postill. I am no developer, and both Ruby and Rails have been used worldwide with great success but somehow I have managed to avoid getting to know either of them. This presentation added fuel to the fire that is already telling everyone “if you haven’t looked into Rails yet, you really need to now” – and this is completely true.

Building web applications holds a lot in common with production line automation. We’ve really progressed beyond building the same tools and parts again and again to make websites – Rails stops the need for reinventing the wheel and I was able to appreciate that finally today. Coincidentally during the presentation I was able to create a simple blog-style app using Rails that I had been meaning to do as part of some server testing, so the impact for me was doubled. Even as a non-developer I can appreciate it, and that bodes well.

Aside from the presentations, there is a general feeling of inspiration surrounding this entire event. The air is charged with the collective intelligence and open-source passion of hundreds of enthusiasts for the same cause. I’m looking forward to not only the rest of the scheduled conference but working on problems with a fresh mind in the spare time here and being inspired to do new things.

0
Comments

….So you think you have a spam problem?

Published January 20th, 2009 by Keiran Holloway

Earlier today we started seeing multiple monitoring alerts from our network monitoring station suggesting that two mail servers which we manage were under considerable amounts of system load.

This became so bad that email began to be delayed and in some occasions clients attempting to connect via pop and imap were timing out… meaning that mail was unable to be retrieved … This is strange behavior and something that is really only going to occur when the system is under considerable amounts of load.

Subsequently, after completing an amount of investigation it appeared that the vast majority of the mail was destined for one specific domain, and addressed to some really suspect emails addresses which were never likely to exist such as: 559611098.73168680243309@domainname.com – It seemed as though that email was coming in such a fast rate that it not only caused the primary mail server to become saturated with inbound connections that it was starting to also saturate the secondary mail exchange; becoming effectively a denial of service attack aimed specifically at one of our customers. The connections appeared to be originating from a number of large network blocks based primarily throughout Russia and the Ukraine.  Once this was  identified, we added some clever rules to our network to block this traffic and all services were restored as per usual.

Once this issues was resolved a post-mortem was carried out and some staggering numbers were discovered.  During an hour period of this attack we saw somewhere in the vicinity of 96,000 messages destined for this one domain which were addressed for non-existent email addresses.

Doing the maths that runs out at 1600 message per minute or 26 SPAM messages PER SECOND!

On this basis, the next time I hear someone say “I have a spam problem!!!” after receiving 3 or 4 unwanted mesage I am probably going to have a chuckle to myself and think, you’ve got nothing! :)

0
Comments