Know Thy Enemy

Published August 4th, 2009 by matt

Long before any code gets written or any servers deployed, a quiet yet crucial job is being performed. The poor tech who is doing this work won’t get much credit, and almost certainly none of the glory, but if this job isn’t done properly, then none of what gets done later will be of much use.

I am, of course, talking about… requirements gathering (bom bom bommmmmm).

In the case of project starbug, the requirements gathering work sits at the “fairly straightforward” end of the spectrum — but it’s by no means easy. What makes the job easier than average is that the site is currently operational, and our primary job is making sure that the new server farm that we’re building will (a) match what is currently running (in terms of system setup), and (b) have enough capacity for future growth.

The system configuration to support the customer’s application is fairly easy to achieve for this project — the customer knows their software and what it requires really, really well, and we’ve got the existing setup to examine to try and work out how the pieces fit together if it isn’t immediately obvious. The main requirement here is to make sure that all the requirements are documented thoroughly. Yeah, writing docs isn’t the glamourous end of the job, but it is important, and is something that pays dividends down the line. More on that in another article, though.

The capacity issue is a lot trickier. The new architecture we’re building is completely different to the current architecture (which isn’t scaling well, hence why it’s being left behind), so it’s hard to draw any direct performance metrics by just looking at what hardware is already in use (especially since the current setup uses virtualisation a little too heavily, which makes comparisons based on hardware spec even harder).

Based on a cursory examination of the bottlenecks in the existing system, along with previous knowledge of the system behaviour, I decided that the primary bottleneck of the system is disk I/O. This site isn’t your typical large-scale website; it does a lot more file management than is typical. As a result, the key thing we need to ensure in our new hardware setup is that there is sufficient disk I/O capacity.

Memory constraints (large app servers, mostly) take a close second in the “what is going to kill us here” stakes, as the current infrastructure is using somewhere north of 100GB of RAM (spread across all the various servers that are being used). We want to provision this plus some extra, as moah RAMs == moah disk caching, and moah disk caching == better effective disk I/O. Win all round!

CPU, on the other hand, is practically never an issue. The servers run a lot of separate processes, but they’re almost always waiting on stuff coming from the disk, so with the current state of the art in server CPUs being quad core, we really shouldn’t have a CPU bottleneck.

Although I said earlier that memory and disk I/O were tied for the title of “biggest performance bottleneck”, there was really no competition for which one of these was going to keep me up at night. Solving the memory problem is easy — modern chassis can easily accomodate 32GB (or more) of RAM. There was never any doubt that we’d be using at least a half dozen machines, so stocking them all with 32GB of RAM should be plenty.

No, the worry was always going to be the file I/O capacity, and making sure that we had both the speed we needed, as well as the storage capacity. While the site doesn’t need petabytes of storage, it does need a decent amount of space, and it all needs to be pretty quick. What’s annoying (but understandable) about storage systems is that you can either have a lot of capacity (1.5TB SATA drives are common as MCSEs) or you can have a lot of speed (15k SAS drives max out at 300GB). We could get the storage space we needed with 300GB drives, but will it be quick enough?

To try and make some sort of an apples-to-apples comparison, I needed to have a number that represented how much I/O was being done at present, and which could be compared to what our new hardware infrastructure is capable of.

In the end, what I went with was running the sar tool on a number of the existing machines to try and get an idea of how much disk I/O is being requested by the machines. There are a number of things that might make this comparison inaccurate, but in the end I decided that there wasn’t really any better metric.

The key thing was to try and get the statistics at the same “layer” of the stack in both cases — in this case, when the kernel passes the I/O request off to the disk (or RAID controller, in this case). The benefits of this are that it’s a single statistic to compare, and it’s not ridiculously impossible to synthesise a load at this level for benchmarking purposes (obviously, running the live site on the new infrastructure to benchmark the new hardware isn’t a real winning strategy). When all’s said and done, though, these benchmarks are an estimate, and are unlikely to be completely accurate. That needs to be kept in mind when doing the hardware estimations later on.

All of this information gathering and benchmarking takes a pile of effort, but without it there’s no chance whatsoever that any sizeable infrastructure will be correct for the job it needs to do. I was surprised in this case at how little hardware we ended up needing, however on a previously sized system I worked on the initial guesstimates turned out to be an order of magnitude too low (the system ended up with some thirty-odd servers instead of the five initially ordered). Without a comprehensive analysis of the reality of the situation, you’re either going to end up with a poorly performing site, or else a pile of unused hardware.

0
Comments

Infrastructure development as performance art

Published August 3rd, 2009 by matt

Anchor recently signed a new customer. This is not normally news, but then again, this is not a normal customer. They’re fairly sizeable, and need a large scale dedicated infrastructure to handle their request volume.

Because of the scale of the development, and some of the novel approaches we’re going with, we’ve decided to blog about the experience of setting it all up. In effect, we’ll be doing the development of this infrastructure in public. Over the next couple of months, as everything comes together, I’ll be regularly writing up what we’re doing, how we’re doing it, and the good, the bad, and the ugly. Some details will need to be obscured, for customer confidentiality reasons, but as much information will be made public as we possibly can. If you’ve never been involved in a big infrastructure project, hopefully you’ll be able to get a feel for what goes into something like this.

We’re code naming this sizeable effort “Project Starbug”, and all the blog posts in this series will be tagged with “project starbug”, for ease of identification. Follow along, and watch the adventure unfold…

0
Comments

View from the top

Published March 19th, 2009 by matt

The venerable (and still exceedingly useful) top tool is immensely useful for seeing who is consuming all your CPU and memory. However, it’s not so good on showing who is eating your disk IO, or network bandwidth.

Unsurprisingly, people have run with the top concept and produced a wide range of other tools:

  • iotop, to show the consumption of disk IO (which we’ve previously covered in detail);
  • iftop, for your network;
  • htop, an enhanced top with bargraphs and other “Sysadmin 2.0″ features;
  • mytop, for when there are queries that are killing your MySQL server.

All top tools, and worthy of a look.

(sorry, couldn’t resist)

0
Comments

Global connectivity monitoring

Published March 19th, 2009 by matt

funny-pictures-the-internet-is-a-series-of-tubes

If you manage a network on the Internet, you are committing to providing connectivity to practically the entire world, while only having direct control over your local connectivity. Worse still, you usually only have good visibility into local network conditions, which makes knowing about (as well as investigating and resolving) connectivity problems from other parts of the world a massive pain.

Clever people on the Internet, though, have already noticed this problem and are here to help. My network tool of the week is traceroute.org, which offers a huge list of publically-available traceroute servers sorted by country. You give one of these traceroute servers an IP address or hostname, and they’ll show you how they got to it from wherever they are. If the utility of that isn’t immediately obvious…

There’s also lists of BGP looking glasses and a bunch of other handy info too, but just the ability to see a traceroute to your network from Azerbaijan is worth the price of admission and more.

1
Comment

The 800lb Gorilla Knows Where Your Website Lives

Published February 11th, 2009 by matt

If you run a website for commercial purposes, you know that the only way it’s going to provide you with benefit is if people actually visit it. Regardless of what sort of site it is (brochure, company promotion, online store, etc), if nobody’s actually loading the pages, your site may as well not exist. One way or another, you need to drive people to your site.

There are many different ways to get people to visit your site, and different strategies work for different sites. TV advertising, for example, has become popular in the last couple of years, to entice people to just visit the site. Other traditional forms of advertising, as well as online banner or text ads, are also popular. Putting your website name on your stationary and vehicles, and promoting your website to your existing customers works well in certain markets, too.

However, I am fairly confident that regardless of what industry you’re in, search engines make up a fair portion of your incoming traffic. More than that, though, it’s almost certain that one search engine in particular is driving the majority of your search engine-sourced traffic: Google.

For example, the operators of the popular developer question-and-answer forum Stack Overflow recently published some statistics on their sources of traffic:

Currently, 83% of our total traffic is from search engines, or rather, one particular search engine:

Search Engine Visits
Google 3,417,919
Yahoo 9,779
Live 5,638
Search 2,961
AOL 1,274
Ask 1,186
MSN 1,177
Altavista 202
Yandex 191
Seznam 103

[...] Google delivers 350x the traffic to Stack Overflow that the next best so-called “search engine” does. Three hundred and fifty times!

These numbers are almost certainly skewed more heavily towards Google than your average website, because the sort of people who benefit from Stack Overflow (software developers) are also the sort of people most likely to use Google over another search engine, but even if 100 times more people in the general population used a different search engine (and let’s face it, that’s not particularly likely), Google would still account for three-and-a-half times the incoming traffic of the next best search engine.

As a result of this massive traffic skew, if your business relies on search engine traffic, the main search engine you need to be targeting is Google. While there are seemingly endless parades of shonky search engine optimisers who will submit your website to “thousands of search engines”, the simple fact is that if all these thousands of search engines are providing you with the same proportion of traffic as “Seznam” (the 10th search engine on Stack Overflow’s top 10 list), then you’d need to be listed on over 33,000 search engines to match Google’s traffic contribution. Or, you could just make sure Google likes you, instead, for far less effort and expense.

This reliance by the world’s online population on one search engine isn’t necessarily healthy, though. As Whimsley describes in his excellent article, Mr Google’s Guidebook, Google has fundamentally changed the way the Web works, and in many ways it now dictates how websites are designed and marketed. The very fact that we are talking about “making sure Google likes you” and optimising your website for the Google indexer strongly suggests that Google is, in fact, “more a master than a servant”.

Philosophical arguments aside, though, you can’t afford to ignore Google if you want your online business presence to succeed and work for you. What can you do?

First off, I’d like to discuss the use of professional SEO (Search Engine Optimisation). While there are a few firms out there who do a decent job, it is a huge market for lemons. It is incredibly difficult to assess the actual value that an SEO is going to give you, in advance.

As a technical person, I’ve dealt with implementing the recommendations of a lot of dodgy SEO people over the years, and it’s not pretty. A lot of what SEO “experts” recommend are things that Google themselves have specifically debunked, like the virtual hosts vs dedicated IP address myth. Other times it’s doing things that Google specifically warns against, like buying links to boost your pagerank.

In several cases, I’ve seen a client of mine hire the services of a shyster, who has done everything that Google advises against, to provide a short-term benefit. The customer’s site shoots to the top of the Google rankings, the customer is pleased, and pays the SEO a big chunk of money. Some short time (less than a day, in one infamous case) later, the customer’s site disappears from Google’s index entirely. The SEO doesn’t care — they’ve got their money and are onto the next victim — but the customer’s website reputation is in ruins, as Google has detected all of the dodgy work, and has blacklisted the site from their indexes. Cleaning up from this mess can cost you many thousands of dollars directly, as well as lost revenue from people not being able to find you. In many cases, it can easily kill your business completely.

The simple fact is that there’s no real secret to SEO. Google is quite open in many ways about how it ranks pages, and what benefits and harms a site’s rank. It has a whole part of it’s main site dedicated to disseminating information to webmasters about how to do better in the site rankings, and what to avoid doing. You don’t need a professional SEO to tell you these things — there’s nothing secret about it all, and not even anything particularly difficult. However, if you do decide to hire a professional to help, here’s a few tips to make sure you don’t end up doing more harm than good:

  • Avoid anyone who talks about “thousands of search engines”. While Google isn’t the only search engine out there, there isn’t more than a half dozen or so that actually matter on an individual basis. Most of the reputable work that is done to improve your ranking in these mainstream search engines will also automatically help other search engines, too.
  • If you know any other online business owners personally, ask them if they’ve had any SEO work done, and get recommendations. If their search rankings have been consistently improved for three to six months after the SEO has been paid and left, then there’s less likelihood that they’re a fly-by-night shyster, and they may be worth using for your business.
  • Never, ever let an SEO modify your site content directly. Not only might they do deeply disreputable things to your site’s content without giving you any way to easily check what they’ve done, but if their work conflicts with your site designer’s work or processes, it might cost you a lot of money to fix. Having the person who did your site layout and content work review any SEO recommendations can also act as a filter against the worst excesses of a bad SEO.
  • For every recommendation that an SEO makes, ask for a citation regarding the legitimacy of the recommendation. If the SEO can’t show you where on a search engine’s site it recommends doing a certain things, then the chances are it’s a dodgy practice.
  • Do some research of your own on any recommendation you feel might not be above board. Don’t take the SEO’s word for it that it won’t cause you problems down the line.
  • If an SEO says they’ve got “secret” information about how a search engine works, run like hell. Nobody’s better at keeping secrets than Google (they’ve got 10,000 employees, yet nobody outside the company has any idea how many servers they use — is that good secret-keeping, or what?). The chances of a given SEO really having secret information is very slim indeed, and even if they do, the search engines can always change the way they do things to punish your site for gaming the system. It’s just not worth it.
  • If possible, have a trusted technical person (such as your website designer, or your hosting company) review the recommendations of the SEO. While it might cost you an extra couple of hundred dollars to have this checking done, what is the cost to your business if your site was blacklisted by the major search engines for doing dodgy things?
  • Try and get a longer-term contract for an SEO’s services, one that involves periodic payments over a 3-6 month period after the initial optimisation work has been done. This will tend to discourage the shysters, as their business model is one of “do some quick work, boost rankings temporarily, grab the cash, and get out before the whole thing falls apart”. A trustworthy company is far more likely to be happy with a longer-term relationship.

Whether you hire a professional or go it alone, it’s good to educate yourself a little about what sort of things the search engines recommend. Some excellent resources on this subject include:

  • Google Webmaster Central — the start page for anything related to improving your site in Google’s eyes.
  • The Google Webmaster help center — a collection of helpful articles about the how, what, and why of designing a site to be Google-friendly.
  • The Google Webmaster blog — chock full of interesting articles and tips for webmasters.
  • The Google Webmaster Dashboard — a fantastic resource that lets you peer into all the information that Google has about your site, like how many links there from other parts of the web to the various pages on your site, whether the crawler has had problems finding some of your site info, how your sitemaps are helping, the effects of robots.txt changes, and removing pages from the index that you don’t want showing up in search results.
  • The Google Webmaster Forum — where you can ask questions of other website owners (and Google employees), and find out loads of useful information on topics that you’ve probably never even thought about.

Yes, all of those links are Google-specific, because Google makes all this info easily and clearly available, and you get the most bang for your buck by targetting Google. Most good site design ideas will help with other search engines, too, so following Google’s advice will benefit you in general.

2
Comments

Aggregating RRD data from multiple files

Published February 2nd, 2009 by matt

The RRD (Round-Robin Database) file format is a beautiful piece of work. It is used for storing time-series data in a (storage and CPU time) efficient form, with a fixed file size, and with some great support tools to retrieve, manipulate, and graph the data in various ways.

One problem you tend to hit every now and then, though, is that you want to aggregate the data from multiple separate RRD files into one monster graph. The simple method might be to put all the data into one RRD file, but that doesn’t work in the case where you can’t always collect all the data at once — RRD requires that you insert values for all your data sources at the same time.

Now, since we use Cacti for data collection at Anchor, in theory we should just be able to tell Cacti to do this. However, its interface is utter balls, and it always seems to take 10 times as long to do something as it should, so I tend to script this sort of thing instead of trying to fight Cacti. Also, if you don’t use Cacti (you lucky person, you), then you might need to know how to do this.

Recently, we needed to know the aggregate current draw from all the racks in our data centre. We’ve got APC managed power rails in every rack, and we already collect the current data from these devices, but then it’s stored in one RRD file for each power rail. So, we needed to aggregate this data into one big graph, and take some values out of it for management’s edification. Since there’s not a lot of info out there on aggregating lots of RRDs together, I thought I’d put down some notes on the subject.

The standard form of doing a graph in RRD is like this:

DEF:power=rack1.rrd:apc_current:AVERAGE
CDEF:kw=power,240,*
VDEF:avg=power,AVERAGE
VDEF:avg_kw=kw,AVERAGE
LINE:power#ff0000
GPRINT:avg:Average\ current\ is\ %9.2lfA
GPRINT:avg_kw:Average\ nominal\ power\ is\ %9.2lfA

This just takes the apc_current data source from the file rack1.rrd and stores it in the variable power. Then we scale the data source into kW (line 2), take the average of all the data points for both of those, then draw a line for the current, and print the average values we calculated. All pretty simple stuff, and if you work with RRD files at all, you’re probably quite familiar with this sort of thing.

What isn’t as common knowledge is that there’s nothing special about the DEF statement above — you can repeat that as many times as you like, and you can point to as many different files as you need. So if you’ve got, say, ten RRD files with current values in them, you can just do:

DEF:power1=rack1.rrd:apc_current:AVERAGE
DEF:power2=rack2.rrd:apc_current:AVERAGE
DEF:power3=rack3.rrd:apc_current:AVERAGE
...
DEF:power8=rack8.rrd:apc_current:AVERAGE
DEF:power9=rack9.rrd:apc_current:AVERAGE
DEF:power10=rack10.rrd:apc_current:AVERAGE

This will define separate variables for the apc_current data source in each of the files. This also works, incidentally, if you’ve got multiple data sources in each file (like, say, incoming bytes and outgoing bytes).

Once you’ve got your data sources mapped, it’s a fairly simple matter of adding them all together:

CDEF:power=power1,power2,+,power3,...,power9,+,power10,+

The rest of the definition stays the same.

What makes for a slightly more exciting time is when you don’t know, in advance, how many files you’re going to have to merge together. This happens whenever the user gets to specify what data gets included — the script we’ve got here asks you which racks you want to aggregate the data for, and I’ve done bandwidth graphs in the past which showed all of a customer’s IP addresses in one graph. In this case, you need a bit of code, and here’s some Ruby that I use to generate the RPN expression above to add all of the values together:

# Generate an RPN (reverse polish notation) sum of
# the strings given in list.
# A single-element list is supported, with the
# expected lack of addition operator.
def to_rpn_sum(list)
        if list.length == 1
                list[0]
        else
                x = list.dup
                (x.length - 1).times { |i| x.insert(i * 2 + 2, '+') }
                x.join(',')
        end
end

Glue that together with the code to create your list of RRD files, something to write out all the DEF lines (and keep a record of what variable names you use) and you’re pretty much done.

Tags: , , , ,
Posted in FTW

 Leave a comment

0
Comments

The Value of Commercial Software Support

Published February 2nd, 2009 by matt

Here at Anchor, we’re often asked to install commercially-supported software products by our customers. Most commonly, it’s Linux distributions, but hosting control panels, app servers, and various other pieces of paraphenalia all get the treatment fairly regularly.

The internal opinion on the subject is that most commercial support agreements for software aren’t worth the paper they’re written on (a problem made much worse by the fact that you can’t wipe your backside on an e-mail). A recently-concluded saga with a certain prominent North American vendor of Linux distributions has done nothing but reinforce this opinion, to the point that a rant is the only way to deal with the insanity.

In March 2006, we got a problem report from a customer that an aspect of our hosting services was not operating correctly. We investigated, and determined that the problem was that the vendor-provided webserver was crashing. Since this system was covered by a support agreement, we lodged a bug report with the vendor.

The log for this report in the vendor’s bug tracker reads like a primer for “how not to provide tech support 101″, with various people from the vendor commenting on the issue and asking for information that had already been provided, and generally tripping over each other to dodge and weave and avoid investigating and fixing the problem.

We also enjoyed the repeated use of a wonderful stalling tactic: demanding the provision of a large (> 650MB) dump of system information before investigating the problem. In addition to the practical problems of uploading a CD’s worth of data over Australian-grade ADSL uplinks to a flaky FTP server on the other side of the world, this dump contained various customer confidential information, which made it a gamble to upload. It also contained nothing of actual use in diagnosing the problem. (I know for a fact that the info dump was unnecessary, because the problem was eventually fixed — by us — without needing anything in that file, but instead entirely using the information we originally provided).

Overall, the entire bug report documented a thoroughly unhelpful exchange, spanning several months, with the guy on our side of the keyboard getting obviously more and more frustrated as the weeks went by. I wasn’t involved in the original bug at all, but even I got worked up reading over the log.

Eventually, in July 2006, we gave up on the vendor, worked out a very ugly and kludgy workaround ourselves, and closed the bug in disgust, hoping that the problem would never rear it’s ugly head ever again. It did, repeatedly, but each time the kludge was folded, mutilated, and spindled some more to provide further relief, because the idea of going back to the vendor was just too horrible to contemplate.

Things remained in this state of critical stability until a couple of weeks ago, when the problem once again became the focus of our attention. The difference this time was that this time the bug report landed on my desk, and I was flush with success after finding another Apache segfault bug (this one a security vulnerability) late last year. I figured I could dive in and find the bug.

It turned out to not be quite so easy as the previous one, but after about two and a half days of digging and poking, I did manage to unearth the source of the bug. It was, as it turns out, entirely due to a coding mistake in the vendor-provided webserver, and it was entirely diagnosable with the data that was originally provided in our bug report of 2006.

Things took an ugly turn at this point, though. Despite the vendor having expressed no interest in finding and fixing the bug in their software, I decided to send the patch to them, in the interests of being a good OSS citizen. Their reaction was utterly incomprehensible:

  • Despite being told in the original message that “the attached patch fixes the problem”, they asked “I would like to know if the patch you have uploaded solves your issue” — like I’d upload a known-broken patch, and say it fixes the problem. Sheesh.
  • They again asked for the gigantic system info dump, which we’d previously told them we couldn’t provide.
  • They also claimed that, since the OS release in question would be going out of support in around 6 months time, it would be very unlikely that a patched release of the webserver would be forthcoming.

So, in other words, if you’re running a commercially-supported software product, for which you’ve paid quite a considerable sum of money, you can expect that the “supported period” will be shorter than your contract promises, you’ll be given the runaround, the vendor will do anything they can to avoid having to actually do anything, you’ll be asked idiotic questions that anyone with a fundamental grasp of the English language would be able to answer from the existing bug log, and even when you do the vendor’s job for them and fix the problem yourself, they’ll still persist in jerking you around. And somehow, somehow, that’s better than saving the money and just being able to fix problems yourself, when and how you need to?

Sorry, but screw that for a game of skittles. I’m not against paying people for assistance, but if I pay you for assistance, I’d really appreciate it if I actually got some.

I’m having trouble recalling a situation in which I’ve actually gotten a consistently good experience out of a software support organisation. This isn’t an isolated incident — it seems like every time a problem is reported in a piece of commercially-supported software, the relevant vendor deems it more cost-effective to avoid the issue rather than fix it. That this seems to actually work (since people still keep paying for “support” when they don’t get any) is a sad indictment on consumers of IT services, while the fact that nearly all commercial software vendors are willing to screw their customers over is a horrible, soul-destroying realisation.

While the plural of anecdote isn’t data, my experiences, and that of the rest of the Anchor staff, really only suggest one thing: software “support” contracts aren’t really worth an awful lot, in the absence of real, strong performance guarantees. (Why nobody will give you an effective performance guarantee is left as an exercise for the reader.)

That isn’t to say that paying for software is never recommended. If the software you want to run is a commercial product, then there’s only one option — pay for it. Copyright infringement isn’t cool. Personally, I’ve not been the least bit interested in a commercial software product — other than Wii games — in the last 10 years, but I’m weird. Other people have differing opinions on the subject.

However, when you’re making the decision to buy a commercial software product, bear in mind that all you’re paying for, in practice, is the right to use the software. Any support services you are promised are unlikely to be of any value whatsoever. In fact, if the software product isn’t Open Source, then it’s value is actually lower, because nobody except the vendor can fix problems you come across — and the chances are that the vendor will not fix the problem for you. Ouch.

This might sound like a weird statement coming from a company that makes some of it’s revenue from servicing software. While we’re a hosting company, it doesn’t take very long for some customers to get out of their depth and need some specialist assistance in getting something running on their server, and we’ve got the expertise on-staff to help with those of things — for a suitable fee.

The difference between what Anchor does and what most software companies do is that we’re not selling software, just expertise. We also have no ability to lock you into using a particular piece of software or service, and hence if we don’t provide a good service, there is nothing stopping you from going to someone else next time. That tends to keep us on our toes.

But that doesn’t mean that our support level couldn’t decrease in the future, so it’s important that our customers don’t accept bad service — from us, but also from anyone else.

Everyone, both customer and service provider, needs to have high standards, and demand those high standards from their suppliers and customers. There’s way too much laissez faire in the IT industry.

3
Comments

Nagios plugins: a two minute hate

Published January 20th, 2009 by matt

If you asked one of your friends how their parents were doing (assuming that their parents were nominally alive, and that you had an appropriate degree of friendship that permitted such social intercourse), and they replied “I’m not sure, I haven’t seen them in a while”, is it reasonable for you to reply with a statement of your condolences, on the assumption that they’re dead?

No, of course not. That would be foolish. Whilst it is possible that the reason why your friend hasn’t seen them in a while is because they’re dead, and it’s possible that they’ve died without your friend being aware of it, there is no practical reason to believe that your friend’s parents are, in fact, deceased. The number and probability of the non-death-related reasons why your friend hasn’t seen their parents in a while far outweigh the death-related reasons.

Given this fairly straightforward logic, why do Nagios plugins insist that practically any inability to check whether a service is OK or not results in a critical alert? Network error? That’s critical. Plugin timeout? That’s critical. Criticising the false critical? Oh, you better believe that’s critical.

A critical alert should mean “OMG, this is down, you need to have a look at this”. It should not mean “hmm, the machine might be a bit loaded at the moment and isn’t responding quite quick enough for my liking”. If you want to be alerted in that instance, then you can tell Nagios to alert you for “unknown” events. Making it impossible for an alerting system to distinguish between “your disk is full!” with “I couldn’t find out whether your disk is full” is ridiculously annoying.

As it stands, the ability to respond to actual problems in a timely manner is greatly diminished by these false alerts. Your choice is either get woken up for hundreds of false alarms for every actual, needs-to-be-dealt-with problem, or retry your service checks for so long to reduce the chances of a false positive that you don’t know that something’s broken for such a long time that customers notice the problem before your monitoring system does. Either way, it’s annoying, pointless, and makes big dents in the utility of your monitoring system.

So, my self-appointed task for the train ride home — patch a few critical checks to produce an unknown when it doesn’t know if the service is down, rather than assuming the worst, and freaking everyone out with premature notice of their parents’ demise.

Posted in WTF

 Leave a comment

1
Comment

Deep Bug Hunting

Published January 20th, 2009 by matt

As practically anyone who has spent more than a couple of hours of their life using computer can attest, software is not the most reliable of human creations. Most software is very, very buggy, to the point that small errors and irritations happen so often that most people’s instinctual reaction is either to ignore them, or just apply an almost-instinctual workaround.

The cause of most bugs that aren’t ignorable are solidly at the “trivial” end of the scale, too — a config problem, usually, or at worst a fairly trivial logic bug, where the symptoms and logging information point quite clearly to the source of the problem. However, every once in a while, something really vicious comes along, and it’s those bugs that make people really, really glad that they’ve got a solid technical team behind them, rather than some fly-by-night box-shifting outfit.

This blog post describes a problem that has plagued Anchor for quite some time now, and how I fought to fix it properly, once and for all. It is, essentially, a big brag about how clever I am, with some (hopefully) interesting observations on how to go about diagnosing a particular class of tricky bug.

Step 1: Describe The Problem

The problem report initially came in as “the reseller control panel displays blank pages for certain resellers, instead of the login page they expect”. All things considered, that’s a pretty good problem description. It’s got all the elements you need:

  • how to reproduce the problem (look at the reseller control panel for one of the resellers that’s shown as having a problem);
  • what the customer is seeing (so you know that you’ve reproduced the problem accurately); and
  • what they’re expecting to see (so you know when you’ve fixed it).

This is a crucial first step in any troubleshooting endeavour — get the particulars of the “crime”. It can be a surprisingly difficult one, too, when dealing with some people. But when it comes to the crunch, you can’t start debugging until you’ve got this info, either by direct report or by inferring reality from whatever info the bug reporter did give you.

If you don’t have this info, you don’t know how to make the problem occur, you don’t really know what you’re trying to fix, and you won’t know when you’re done.

Step 2: Identify the at-fault actor

Or, in slightly less elevated terms, “work out what you’re going to have to attach the debugger to”. This isn’t so much a matter of solving the problem as finding out who is causing it, without (necessarily) quite getting into the “why” just yet.

Sometimes, this is trivial, to the point that it isn’t even an issue. But with a modern web service, there’s many different elements at play, and you can’t start really poking at things to diagnose the problem until you know what it is you have to poke. Is it a browser/display level problem? (Dodgy CSS games can produce a blank page, for example). Is the network transporting a correct and complete request/response pair? If a complete, but blank, response is getting sent over the network, who is generating it — it could be the webserver itself, or an underlying CGI / web application doing the dirty work.

In this case, after a bit of faffing around, it became clear that the cause of the problem was Apache segfaulting. How did I find this out? Because, after a few false starts, I looked in the apache error log and found “child pid XXXXX exit signal Segmentation fault“. Proving, once again, that the error logs are the place you should go to first, not after you’ve spent a half hour or so running tcpdump and strace

Knowing where the problem is coming from is very important, as it gives you somewhere to focus your debugging attentions on. However, in this case, the unbridled joy of discovery was tempered by the knowledge that segfaults are rarely simple to track down in software as large, complicated, and widely used as Apache. All the simple bugs will have been found and fixed already, meaning that this one was likely to be weird and tricky.

Step 3: Generate a minimal reproducible test case

What kept me out of the pits of despair when I found out that I was dealing with a segfault is that the problem was 100% reliable — it happened every time I performed a very simple action, immediately after Apache started. For deeply technical reasons, two seemingly identical invocations of the same program might not produce a segfault both times. Thankfully, this wasn’t the case here, so tracking down the problem might be a tedious exercise, but not a frustrating hunt for a bug that manifested itself seemingly randomly, making you think you fixed it when in fact it was just hiding.

In general, to make your life easier, you want a test case that demonstrates the problem consistently, without requiring a lot of manual effort to create the pre-conditions every time. If you’ve got to spend 20 minutes setting up the test environment every time you want to retry the test, you’ll spend weeks nailing the problem down. Also, if the test sometimes doesn’t show the problem, you can never be quite sure that you’ve fixed the problem — it might just have not shown up this time around.

In this particular case, since the segfault was reliably reproducible, but quite dependent on the versions of all the software involved as well as the configuration, it was easier to run a separate instance of the software on the same machine, with the same configuration, than it was to setup a wholly separate test system. This was a little risky, as a screw up could cause some nasty problems, but after a little hand-wringing, it was deemed worth the risk.

So, what I ended up with was a file tree with all the configuration files from the “live” installation, with the port numbers changed, along with a little script to start the program that I wanted to debug with all the options I wanted, including one to use the alternate config file location. Since Apache has a lengthy startup procedure (which running in the debugger would have slowed down considerably), I opted to start the program, wait for it to finish it’s initialisation, and only then attach the debugger to it (gdb --pid=XXXX ftw).

Sidebar: Use Open Source Software

Pretty much everything I did to fix this problem myself was only possible because Anchor bases all their infrastructure on Open Source software, like Linux, Apache, and so on. If this same problem had been present in a proprietary stack, then there would have been very little hope of identifying the problem — and no hope at all of fixing it — other than to beg the vendor for a fix, which I’m fairly certain wouldn’t have happened due to the age of the system involved and the obscurity of the bug. In fact, we tried some time ago to get the system vendor to take responsibility for the bug (since it was, after all, in the software they provided and purportedly provided support for), but they wanted nothing to do with the problem, and basically refused to look into it, so the only reason this problem is solved at all is because we had the source. The impact on our customers, particularly the periodically recurring nature of the problem, meant that this bug was not something we could ignore. How the hell can you base a customer-satisfying business on systems you can’t do anything to fix? I don’t get it.

Step 4: Produce debugging-enabled versions of programs and libraries

Compilers exist because computers and people understand very different languages. A compiler translates human-friendly languages into computer-friendly languages. Debugging symbols serve to act as a translation medium the other way, to allow people to comprehend what’s going on in the computer.

The debugging info for a given program is often twice the size of the program itself, so if everything came with the debugging symbols the size of the program on disk (and in memory) would be three times larger than it is, which has implications for memory usage and startup times. So instead, you get the incomprehensible version, which is fine for most people, but is useless for debugging. There is a nifty workaround for this problem available on modern distributions, in that you can ship the debugging information separately from the program itself and have the debugger load it, but that option didn’t exist in this case due to the age of the system.

In the absence of debuginfo packages, to get a debugging version of a piece of software, you need to rebuild. Typically, this should just involve making sure that the -g and -O0 options are passed to the compiler (-g for “generate debugging info” and -O0 for “no optimisation”)[1], that there aren’t any other -O flags being passed, and that strip isn’t being called anywhere[2].

Unfortunately, “typically” and “actually”, like most other applications of theory and practice, aren’t all that close. I actually spent more than a day getting a debugging version of openssl, and more than a little bit of time on mod_ssl. (Apache itself was the work of a few minutes, thankfully). The time involved was exacerbated by my insistence on doing things “the right way”, by backporting a generalised method of producing auxiliary “debuginfo” packages (which contain the debugging information as separate files) to the utterly ancient (though still, theoretically, vendor-supported) version of the Linux distribution we’re running on the machine that was having the problem. Once I got that stupid idea out of my head, it was only a couple of hours (and some swearing at stupid, incorrect GCC documentation) to produce symbol-enabled versions of the appropriate libraries.

Step 5: Debug the program

I used to teach introductory computer science, and I never came up with a way to “teach” debugging processes to a class of students. The mechanics of using the debugger were easy to impart, and it was fairly simple to walk through finding a bug in a student’s program with them, and after a while they got the hang of it themselves, but I never worked out how to inform a class of students on the delicate art of finding problems in their programs using a debugger in the general sense.

As a result of my failure, you’re largely on your own if you’re tracking down a random bug of your own. However, here are a few “rules of thumb” that might help you if you’re tracking down a bug similar to mine:

  • First up, get a backtrace of exactly where the segfault occurs. Identify the memory access that is causing the problem, then walk up the call stack until you work out where the bad address is really coming from.
  • NULL pointers (and their cousins, the NULL pointer offset, caused by accessing a field in a struct whose base address is 0x0) are the easiest to work out, but most of the time you can see which of several addresses is the faulty one with a bit of printing in the debugger.
  • If you’re lucky, the reason for the source of the dodgy memory access will be obvious as soon as you look at the code[3]. In that instance, fixing the problem should be relatively straightforward, too.
  • More often, though, the value that’s being passed down the chain and causing the problem is just being read out of a variable somewhere, and the value was actually set erroneously somewhere else. In that instance, you’ve got the unenviable job of working out the ultimate source of the cruft. Here I can offer no general advice, except perhaps gdb watchpoints and a lot of grepping through the source code and setting of breakpoints at likely points in the code. If you get discouraged and the problem seems impossible, just remember that the computer isn’t doing this to piss you off, and everything it does is fully described by the code you’re looking at, even though the code probably isn’t obvious in what it’s doing.

These rules are ridiculously general. Every situation is wildly different. The only thing I’m sure of is that the more code you’ve written and stared at, the easier debugging is, simply because you’ve made (and seen) more boneheaded mistakes, and therefore can recognise the sorts of wrong things that programmers think of.

Step 6: Fix the problem

What irritates me the most about debugging a problem like this is that once you’ve found the problem, the fix is typically trivial — a slight adjustment to the type of a variable, or slightly modifying the way a function is called.

But once that’s done, you’ve got the entertaining job of feeding the fix back to the community, so that:

  • Others can benefit from your effort. The software you’re using is only as good as it is because lots of other people have spent time, like you, tracking down bugs, fixing them, and sending patches back.
  • The next release of the software doesn’t have the same bug again, so you don’t have to continually fix the problem again and again in every new release.

Adieu

Hopefully you’ve now got a slightly better idea of how you might go about finding a tricky segfault bug of your own. If it all looks a horribly daunting, don’t worry too much — not everyone has to know how to fix these bugs, you’ve just got to hire professionals who do, rather than numbskulls who think that installing a cracked version of plesk on the Windows box in the corner of their bedroom makes them a hosting company.


[1] Turning off optimisation is a bit of a risk, since some bugs only appear in certain optimisation levels (and may $DEITY have mercy on your soul if you hit one of those!), but optimised code doesn’t quite map to the debugging symbols and source code you’re working from, which makes your job sooooo much harder.

[2] The strip command is used to remove the debugging information from a binary, and is usually run at the end of the packaging process to make sure that no debugging-enabled programs “leak” out into the wider world, which is of course completely backwards for what we want to do in this particular case.

[3] Incorrect pointer arithmetic, for example, is a nice way to end up in segfault country. I found another segfault in Apache a while ago because it was calculating an offset in memory against a char variable, but chars aren’t guaranteed to be unsigned by default and hence it was going backwards for character values > 127, but only on platforms where chars are signed). Whoops…

Posted in FTW

 Leave a comment

0
Comments

Keeping your kernel output safe

Published January 20th, 2009 by matt

Keeping logs of the operation of your system(s) is really important; when something goes wrong in the middle of the night, a good log can give you all the information you need to diagnose and fix the problem before it happens again.

One area of your system that’s quite crucial to keep, but which is often forgotten, is your kernel’s dmesg output. This is all of the messages that come direct from the kernel, from “filesystem mounted” to “Aiee! Penguins on the SCSI bus!” or “lp0: on fire”. While the former isn’t so important to keep for posterity, when the kernel crashes, you really do want to capture that output, but you can’t use your system’s normal logging because when the kernel dies, it usually takes userland processes like syslogd with it.

Enter: the netconsole. This is a nice little kernel module that’s been around for years, but isn’t widely used, probably because (a) nobody knows about it, and (b) it’s not so simple to setup. While this blog post is a feeble attempt to solve (a), Karsten M. Self has done a good job of assisting with (b) in a recent post to the linux-elitists mailing list. I encourage everyone to take a look.

Tags: , ,
Posted in FTW

 Leave a comment

0
Comments