Channelling your rage
Published February 3rd, 2012 by Barney DesmondGetting notifications when servers break is always annoying. We use Nagios at Anchor, a very popular solution. “Friggen nagios!” is a pretty common cry.
If you get a lot of notifications in quick succession, your Rage meter starts to build up. When it hits 100% you unleash a special attack and reboot the server.

Rachel's gauge is at 100%, circled in blue crayon. She can now reboot the server with her Static Iris
That’s pretty cool, but it turns out that customers don’t like reboots as much as us, so we looked at ways to reduce the rage. One great way to do this is with better documentation; we call it Ragewiki.
Making use of the notes_url parameter, we provide a link to our wiki documentation directly from Nagios’ web interface. There’s one page for each service, with precise instructions on how to diagnose and fix common problems, as well as a brief description of what the service actually does.
So now when you get that SMS at 3am (PROBLEM – ntype on fundle is CRITICAL), you don’t spend 20 minutes flailing through A Brief History of Time, as told by H.P. Serverbox.
To sweeten the deal a bit, we also allow for host-specific instances of a service, which might need extra-special instructions. We also have a page full of terse legacy documentation that we’d like to fallback on in case the new docs haven’t been written yet. We think it’s a cute little hack so we’d like to share with you.
The possibilities are up to your own imagination, we just went for the most straightforward option. You could always link to a big red button that reboots the server straight away.
- Give every service a URL in the Ragewiki, using the notes_url argument. We attach this to the generic service template so that every single service automatically gets a link.
# RageWiki ftw notes_url /ragewiki/$HOSTNAME$/$SERVICEDESC$
You’ll notice that we’ve parameterised the URL so that each host-service pair is unique
- Prepare a rewrite map to check for existence of docs
This URL will refer to the Apache instance on the nagios server itself. It captures the request starting with /ragewiki/, extracts the hostname and servicename, then builds a suitable redirect.Because we want to support per-host pages that may exist, we use a RewriteCond and a smart RewriteMap to check whether the page exists, then redirect accordingly. We use moin as our documentation wiki, with HTTP access control in front of that.
RewriteLock /var/lock/rewrite.lock RewriteMap RageWiki "prg:/usr/bin/xargs -n1 -d '\\\\n' /usr/bin/HEAD -sd -H 'Authorization: Basic EncodedUsernameAndPassword'"
You may want to read up on Apache’s RewriteMap functionality to make sense of this. The short version: it contacts the wiki and returns the HTTP status line for the suggested page. A 200- or 300-series status code is considered a success – the page exists and should be used.
- Finally, use the RewriteMap and generate a suitable redirect
This is a basic set of cascading rewrites, the first success will terminate further processing.# Server-specific docs: /servers/$HOSTNAME/$SERVICENAME RewriteCond ${RageWiki:https://magic.ponies.anchor.net.au/servers/$1/$2} ^[23]\d\d RewriteRule ^/ragewiki/([^/]+)/(.+)$ https://magic.ponies.anchor.net.au/servers/$1/$2 [R,L] # Whole lotta BGP goin' on (with variable check names, a variant of generic docs) RewriteCond ${RageWiki:https://magic.ponies.anchor.net.au/Nagios/Services/bgp} ^[23]\d\d RewriteRule ^/ragewiki/[^/]+/bgp[_-].+$ https://magic.ponies.anchor.net.au/Nagios/Services/bgp [R,L] # Generic docs for normal services: /Nagios/Services/SERVICENAME RewriteCond ${RageWiki:https://magic.ponies.anchor.net.au/Nagios/Services/$1} ^[23]\d\d RewriteRule ^/ragewiki/[^/]+/(.+)$ https://magic.ponies.anchor.net.au/Nagios/Services/$1 [R,L] # Catch any checks without docs, and send them to the fallback page. # Funky regexes to pass the failed service name through to the fallback page. # FIXME: Can we use a positive-lookbehind in these things? Would make it slightly tidier. RewriteRule ^/ragewiki/([^/]+)$ https://magic.ponies.anchor.net.au/CommonNagiosServiceCheckReference#$1 [NE,R,L] RewriteRule ^/ragewiki/.*/([^/]+)$ https://magic.ponies.anchor.net.au/CommonNagiosServiceCheckReference#$1 [NE,R,L]Special cases with varied names, like our BGP checks, are easily handled by dropping a custom regex into the chain. It’s best if your service names have a consistent format that can be readily pared back to a basic name, but this method is fine for the occasional odd case.
Too easy! To give you an idea of what we think good Ragewiki docs look like:
- What servers does this apply to?
- Summarise what the nagios check is for (one sentence!)
- What’s the impact of a failure? Customer visible? Websites are down? Etc.
- A short procedure on how to confirm the notification and diagnose it further
- A procedure on how to fix it
That’s it; the page should only be a couple of screens long at the most. If you can’t include all the necessary information, it’s best to put it on a separate and link to it. We specifically don’t include information about How It Works because it detracts from fixing problems faster.
Ragewiki works great for us, so we’d be interested in hearing your thoughts and comments. It’d also be cool to know if other people have reached the same goal, but in a different way.