Nagios plugins: a two minute hate

By January 20, 2009Technical

If you asked one of your friends how their parents were doing (assuming that their parents were nominally alive, and that you had an appropriate degree of friendship that permitted such social intercourse), and they replied “I’m not sure, I haven’t seen them in a while”, is it reasonable for you to reply with a statement of your condolences, on the assumption that they’re dead?

No, of course not. That would be foolish. Whilst it is possible that the reason why your friend hasn’t seen them in a while is because they’re dead, and it’s possible that they’ve died without your friend being aware of it, there is no practical reason to believe that your friend’s parents are, in fact, deceased. The number and probability of the non-death-related reasons why your friend hasn’t seen their parents in a while far outweigh the death-related reasons.

Given this fairly straightforward logic, why do Nagios plugins insist that practically any inability to check whether a service is OK or not results in a critical alert? Network error? That’s critical. Plugin timeout? That’s critical. Criticising the false critical? Oh, you better believe that’s critical.

A critical alert should mean “OMG, this is down, you need to have a look at this”. It should not mean “hmm, the machine might be a bit loaded at the moment and isn’t responding quite quick enough for my liking”. If you want to be alerted in that instance, then you can tell Nagios to alert you for “unknown” events. Making it impossible for an alerting system to distinguish between “your disk is full!” with “I couldn’t find out whether your disk is full” is ridiculously annoying.

As it stands, the ability to respond to actual problems in a timely manner is greatly diminished by these false alerts. Your choice is either get woken up for hundreds of false alarms for every actual, needs-to-be-dealt-with problem, or retry your service checks for so long to reduce the chances of a false positive that you don’t know that something’s broken for such a long time that customers notice the problem before your monitoring system does. Either way, it’s annoying, pointless, and makes big dents in the utility of your monitoring system.

So, my self-appointed task for the train ride home — patch a few critical checks to produce an unknown when it doesn’t know if the service is down, rather than assuming the worst, and freaking everyone out with premature notice of their parents’ demise.

One Comment

  • rodjek says:

    Rather than patching each service check, I have a NEB module that hooks into NEBCALLBACK_SERVICE_CHECK_DATA and NEBCALLBACK_HOST_CHECK_DATA and rewrites the output of the checks (based on rules contained in a config file). More time to implement, but you don’t run into issues when upgrading and/or adding new checks.