Monitoring systems using Nagios
Nagios is an open-source network and system monitoring tool. At Anchor, we use Nagios to keep an eye on all of our web hosting and dedicated server infrastructure 24x7, and let us know if anything goes wrong. We'll go through how Nagios works, what it does, and how it helps you maximise the availability of critical services.
Why is monitoring needed?
It's an unfortunate fact that computers aren't maintenance-free. Machines can get overloaded, have a partition fill up, run out of swap space, a service can fall over and stop - these are all things that can happen to a server while being used, but they require human intervention. While we build and configure our systems to be as robust as possible, monitoring is an important asset that lets us know when there's a problem, which is why we can guarantee 99.8% uptime. Even if you don't have uptime requirements, using Nagios is a great way to get a better idea of what your systems are doing.
How does Nagios work?
- Nagios is installed on a central monitoring server
- We tell Nagios what hosts, and services on those hosts, that we want to monitor.
- Nagios polls the hosts and services periodically, checking for "alive-ness". For a server, this means it must respond to ICMP ping requests. For a service, such as an HTTP server, Nagios checks that it can make a successful connection. The frequency of these checks is configurable.
- If a host or service fails to respond, or returns a not-good reply, Nagios will alert the configured contact by email or SMS. Once this happens, the service or host is considered to be in a Critical state.
- Whoever responds to the alert Acknowledges it via a web interface. This lets other system administrators know that someone is working on it.
- Once the problem is fixed, Nagios will detect this and return it to an Okay state.
How Anchor uses and customises Nagios
One of Nagios' great strengths is its customisability. Checks can be performed by any script or executable you have available. For services that aren't directly accessible to the monitoring station, the Nagios Remote Plugin Executor (NRPE) application lets you execute checks by proxy. The results are passed back to the monitoring station as though it had done them itself.
One great thing about this is that it can be used for non-internet services as well; you can monitor just about anything with NRPE and a little bit of scripting. We exploit this to keep a close eye on parts of the system that wouldn't normally be available for inspection.
- Available diskspace
- Whether the firewall ruleset is up to date
- System load
- Size of the mailqueue
- RAID health
- Swapfile usage
- MySQL/PostgSQL/MSSQL database
Some of these can be handled in other ways, but NRPE offers certain benefits. For example, it may not be desirable to have database servers listening on non-local addresses; NRPE allows for a way around this. Hardware RAID controllers tend to have poor status monitoring, only sending out an email in the event of a disk failure. It's far too easy for this to get lost or missed, and doesn't give the option of positive checks. With use of the vendor-supplied tools, you can check the state as often as desired and know for sure when there's a problem.
The following is a little summary of what we've done to extend Nagios' notifications facilities. For lots of specific implementation details you can check out our page on Nagios notification options.
Email alerts and reducing false positives
Alerts are configurable much like the service and host checks. By default, Nagios will send an email to any contacts listed as being responsible for a given service or host. To reduce false alarms, this will only happen once a certain number of checks have come back Critical. Most of our checks are run every 30 seconds, and these must fail several times before Nagios will start alerting us.
The problem with email is that it's not quick enough. It's largely instantaneous, but when one of our core routers has a problem, we need to know right away. If we were watching our inbox continuously we'd never get any work done. For quick alerts, nothing beats your phone signalling the receipt of an SMS. Similarly, for waking you up in the middle of the night and disrupting your sleep, nothing beats your phone signalling the receipt of an SMS...
Instead of email, Nagios can run a script to handle the alert. We use a web-based SMS gateway to send message to staff members, ensuring timely response for non-trivial problems. The gateway charges a small nominal fee for each SMS sent, typically at a price competitive with regular telcos.
Anchor's staff members are usually logged into our office's Jabber server. As a further enhancement, we've modified our SMS-sending scripts to first check if you're logged in to Jabber. If so, it'll send you an instant message instead of an SMS. This has a number of advantages.
- Much cheaper, given that about half a dozen staff receive SMS alerts
- Easier to ignore or switch off if you're concentrating on an important task
- Keeps the office quieter during the working day
The next improvement we made after Jabber alerts was to configure an escalation path. Before this, everyone would receive SMS alerts, meaning everyone got woken up when something broke late at night. We have to respond to alerts 24x7, so we decided it would only be sane to have an on-call roster to cover the hours outside of 8am to 6pm when we're not in the office. This means much less stress for sysadmins, happier families at home, and less uncertainty when dealing with alerts.
Nagios allows you to have a schedule of contacts for services and hosts, and specify fallbacks if an alert goes unacknowledged for too long. In our case, the on-call staff member has 15 minutes to respond. At this point, all regular sysadmins will receive an SMS alert (in addition to a re-alert for the on-call staff member). If this also goes unanswered, an SMS is sent to the technical director and all previous recipients. Alerts for high priority failures, such as core infrastructure, will always be sent to all staff members.
Another great use for Nagios' scripting extensibility is to monitor web applications. A standard check of the HTTP port might well return Okay, but this is meaningless if your application is having trouble connecting to the database, or can't access the files it needs to. A slightly modified check can be used to fetch the page content and make sure that a specified text-string is present, such as the title of your site, or some other expected content.
If you're an application developer, we'd strongly recommend having some sort of test page for your site. It need not be anything complex, it just has to test the basic functionality and make sure everything is okay. This has the benefit of making troubleshooting easier for you when something goes wrong, a little checklist would be ideal. Nintendo used to have a status page at http://www.nintendo.com/status that would read "A-OK", but it seems they've since taken it down. As an alternative example:
Tempfile directory writeable
Connected to database
- Everything OK
If Nagios can't find the string "Everything OK", it should assume there's a problem (it should only print "Everything OK" if none of the tests failed!). If your application has a status page like this, we can setup monitoring for you.