Channelling your rage

Published February 3rd, 2012 by Barney Desmond

Getting notifications when servers break is always annoying. We use Nagios at Anchor, a very popular solution. “Friggen nagios!” is a pretty common cry.

If you get a lot of notifications in quick succession, your Rage meter starts to build up. When it hits 100% you unleash a special attack and reboot the server.

Rachel's gauge is at 100%, circled in blue crayon. She can now reboot the server with her Static Iris

That’s pretty cool, but it turns out that customers don’t like reboots as much as us, so we looked at ways to reduce the rage. One great way to do this is with better documentation; we call it Ragewiki.


Making use of the notes_url parameter, we provide a link to our wiki documentation directly from Nagios’ web interface. There’s one page for each service, with precise instructions on how to diagnose and fix common problems, as well as a brief description of what the service actually does.

So now when you get that SMS at 3am (PROBLEM – ntype on fundle is CRITICAL), you don’t spend 20 minutes flailing through A Brief History of Time, as told by H.P. Serverbox.


To sweeten the deal a bit, we also allow for host-specific instances of a service, which might need extra-special instructions. We also have a page full of terse legacy documentation that we’d like to fallback on in case the new docs haven’t been written yet. We think it’s a cute little hack so we’d like to share with you.

The possibilities are up to your own imagination, we just went for the most straightforward option. You could always link to a big red button that reboots the server straight away. :)

  1. Give every service a URL in the Ragewiki, using the notes_url argument. We attach this to the generic service template so that every single service automatically gets a link.
    # RageWiki ftw
    notes_url /ragewiki/$HOSTNAME$/$SERVICEDESC$

    You’ll notice that we’ve parameterised the URL so that each host-service pair is unique

  2. Prepare a rewrite map to check for existence of docs
    This URL will refer to the Apache instance on the nagios server itself. It captures the request starting with /ragewiki/, extracts the hostname and servicename, then builds a suitable redirect.

    Because we want to support per-host pages that may exist, we use a RewriteCond and a smart RewriteMap to check whether the page exists, then redirect accordingly. We use moin as our documentation wiki, with HTTP access control in front of that.

    RewriteLock /var/lock/rewrite.lock
    RewriteMap RageWiki "prg:/usr/bin/xargs -n1 -d '\\\\n' /usr/bin/HEAD -sd -H 'Authorization: Basic EncodedUsernameAndPassword'"

    You may want to read up on Apache’s RewriteMap functionality to make sense of this. The short version: it contacts the wiki and returns the HTTP status line for the suggested page. A 200- or 300-series status code is considered a success – the page exists and should be used.

  3. Finally, use the RewriteMap and generate a suitable redirect
    This is a basic set of cascading rewrites, the first success will terminate further processing.

    # Server-specific docs: /servers/$HOSTNAME/$SERVICENAME
    RewriteCond ${RageWiki:https://magic.ponies.anchor.net.au/servers/$1/$2} ^[23]\d\d
    RewriteRule ^/ragewiki/([^/]+)/(.+)$ https://magic.ponies.anchor.net.au/servers/$1/$2 [R,L]
    
    # Whole lotta BGP goin' on (with variable check names, a variant of generic docs)
    RewriteCond ${RageWiki:https://magic.ponies.anchor.net.au/Nagios/Services/bgp} ^[23]\d\d
    RewriteRule ^/ragewiki/[^/]+/bgp[_-].+$ https://magic.ponies.anchor.net.au/Nagios/Services/bgp [R,L]
    
    # Generic docs for normal services: /Nagios/Services/SERVICENAME
    RewriteCond ${RageWiki:https://magic.ponies.anchor.net.au/Nagios/Services/$1} ^[23]\d\d
    RewriteRule ^/ragewiki/[^/]+/(.+)$ https://magic.ponies.anchor.net.au/Nagios/Services/$1 [R,L]
    
    # Catch any checks without docs, and send them to the fallback page.
    # Funky regexes to pass the failed service name through to the fallback page.
    # FIXME: Can we use a positive-lookbehind in these things? Would make it slightly tidier.
    RewriteRule ^/ragewiki/([^/]+)$    https://magic.ponies.anchor.net.au/CommonNagiosServiceCheckReference#$1 [NE,R,L]
    RewriteRule ^/ragewiki/.*/([^/]+)$ https://magic.ponies.anchor.net.au/CommonNagiosServiceCheckReference#$1 [NE,R,L]
    

    Special cases with varied names, like our BGP checks, are easily handled by dropping a custom regex into the chain. It’s best if your service names have a consistent format that can be readily pared back to a basic name, but this method is fine for the occasional odd case.


Too easy! To give you an idea of what we think good Ragewiki docs look like:

  • What servers does this apply to?
  • Summarise what the nagios check is for (one sentence!)
  • What’s the impact of a failure? Customer visible? Websites are down? Etc.
  • A short procedure on how to confirm the notification and diagnose it further
  • A procedure on how to fix it

That’s it; the page should only be a couple of screens long at the most. If you can’t include all the necessary information, it’s best to put it on a separate and link to it. We specifically don’t include information about How It Works because it detracts from fixing problems faster.

Ragewiki works great for us, so we’d be interested in hearing your thoughts and comments. It’d also be cool to know if other people have reached the same goal, but in a different way.

2
Comments

The Zen of Documentation Maintenance

Published August 6th, 2009 by matt

Given that you’ve been suddenly and completely convinced of the need for documentation in my previous post, the question still remains: how does one make documentation appear on a consistent and ongoing basis?

If you’re really, really lucky, you’ve been spared the painful experience of putting up a wiki somewhere (or, worse, forked out a pile of cash for a “knowledge management system”), sticking some info into it at random, and then… nothing. You planted the seeds of a documentation tree, why isn’t it growing, and flowering, and solving all of your problems for now and forever?

For Project Starbug, we’re creating a whole new infrastructure, more-or-less from scratch. This is the easiest possible environment to make work, because you’re not constrained by what is already in place (and that you can’t afford to get rid of), and the whole thing isn’t in production so there’s no need to get freaked out by the thought of taking a major site off the Internet due to making an ill-advised change — and, most relevantly to this discussion, there’s no giant mass of undocumented… stuff that needs to be picked apart and documented. There’s nothing more deadly to motivation than the idea that when you’ve got this bit documented, there’s only 350,000 other bits to go.

So, if I didn’t want to end up with a shiny, new, incomprehensible and undocumented system, we needed to start focusing on documentation right off the bat and build the documentation alongside the rest of the system. This, in turn, meant that we needed to have something easy to work with, well structured, and above all ready to go before anything else could really kick off.

What to use was a no-brainer. Wikis are straightforward to access and edit, and there’s very little downside to them. We use moin internally for our documentation extensively, so it wasn’t a hard sell to spin up another copy of the wiki software to contain all of the documentation for this project. Most widely-used wiki engines these days are on much the same level, though, and it’s really just a matter of preference which one to use — mostly based around the language you’re most comfortable using (Python == Moin, PHP == MediaWiki, Perl == twiki, Ruby == instiki, Java == something useless and enterprisey), because you really want to be able to write plugins and extensions. One day I’d love to try ikiwiki, because that means I can edit wiki pages without even needing to open my web browser, which will be a particularly special kind of bliss.

Why did we use a separate wiki, though, and not an extension of our existing one? We want to communicate with the customer as well as we possibly can, and the content of the wiki is like a big, persistent communications nexus, and giving the customer (especially this customer, who really knows their stuff) direct access to be able to read all the internal procedures and technical information relating to the management of their infrastructure is a massive boon to communication. Who knows when they might see something we’ve written and say, “Hey, that’s not right!” and fix it? We’re the system administration experts, not the experts in their application, so it makes perfect sense to have them as tightly integrated as possible into the management of the whole infrastructure.

Though we may have made it over “Documentation Hurdle #1″, the race had barely even begun. Plenty of well-intentioned doc projects have gotten something started, and then withered on the vine. The key is to make sure that the documentation stays maintained, and keeping up with the growth of the infrastructure and it’s constant changes. The most important way to do this is to identify the reasons why people don’t keep a reasonably useable documentation repository maintained, and remove those reasons, leaving no possible excuse not to write docs. It needs to be easier to write docs than to not write them, otherwise they’ll get forgotten in the pressure of the moment, and playing catch-up is painful and annoying.

In the next article, we’ll examine why people don’t write docs as often as they know they should, and how to create a “documentation culture” in your team.

0
Comments