Nagios notification options

As we wrote in our article about advanced use of Nagios for system monitoring, we use a number of scripts and tools to help us receive and respond to notifications in a timely and efficient manner. These methods are already in use across our dedicated server and virtual private server products. Here, we'll share some of methods.

SMS (Short Messaging Service)

Belying its age, Nagios has a "pager" field in its contact definitions. Chances are you aren't using a real pager, but it's ideal for a mobile phone number. Seeing as pretty much everything is defined as an arbitrary command string, you can make use of this in whatever way is most convenient. We use an SMS gateway provider, specifically Internode's NodeText gateway.

SMS gateways

SMS web-gateway providers will typically offer a form-based service that allows for easy scripting. Internode uses a standard HTTP form that can accept parameters via GET or POST. As an example, you can use curl to do this:

curl -d 'user=username@txt.internode.on.net' -d 'pass=something_secret' -d 'dest=61412456789' -d 'msg=this is the message' -d 'timezone=Australia/Sydney' https://txt.on.net/cgi-bin/sms_tcp.cgi

GSM modem

If you send a lot of notifications, a cheaper alternative may be to invest in a standalone GSM modem. The modem is effectively a computer-controlled mobile phone, which can be used to send SMS messages. The hardware will set you back a few hundred dollars (one suitable model would be the Intercel SAM2W), and a SIM is practically free nowadays. A number of telcos (eg. Virgin and Vodafone) offer cheap plans with lots of included SMS per month. This makes for a cost-effective solution, and has the very nice advantage of working in the face of catastrophic network failure.

Integrating with Nagios

  • The default notification in Nagios is notify-by-email, and lets you drop in numerous details about the notification. Nagios performs its own keyword replacement before executing the result in a shell.

    define command {
        command_name    notify-by-email
        command_line    /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | /usr/bin/mail -s "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **" $CONTACTEMAIL$
    }
  • We'll add another command to handle sending an SMS

    define command{
        command_name notify-by-sms
        command_line /usr/local/sbin/send_sms --number $CONTACTPAGER$ --message "$NOTIFICATIONTYPE$ - $SERVICEDESC$ on $HOSTNAME$ is $SERVICESTATE$ : $SERVICEACKCOMMENT$ : $SERVICEOUTPUT$"
    }
  • Of course we need the actual send_sms script to do the job for us. We could call curl directly from the notify-by-sms command, but it's a bit messy, and isn't as flexible if we want to customise the message a bit. We use a simple perl script to add a timestamp and perform some other cleanup.

    • send_sms

      #!/usr/bin/perl
      
      use strict;
      use warnings;
      
      # You might need to install extra packages to use these modules
      # CPAN is one option, though OS packages might be more convenient
      # Under Debian, you'll need at least libwww-perl and libcrypt-ssleay-perl
      use LWP::UserAgent;
      use Crypt::SSLeay;
      use Getopt::Long;
      
      sub get_date {
        (my $sec, my $min, my $hour, my $mday, my $mon, my $year, my $wday, my $yday, my $isdst) = localtime(time());
        return sprintf("%02d-%02d %02d:%02d", $mon + 1, $mday, $hour, $min);
      }
      
      sub usage {
        print "--message (-m) 'Your message'\n";
        print "--number (-n) 614XXXXXXXX\n";
        print "--test (-t) 0|1\n\n";
        print "The 'test' flag is optional, Internode will send an SMS by default.\n";
      }
      
      sub send_sms($$$) {
        my $number = shift;
        my $message = shift;
        my $test = shift;
      
        my $ua = LWP::UserAgent->new(env_proxy => 1,
                                     keep_alive => 1,
                                     timeout => 20,
                                    );
      
        my $url = 'https://txt.on.net/cgi-bin/sms_tcp.cgi';
      
      
        my $resp = $ua->post($url,
                             [ user => 'username@txt.internode.on.net',
                               pass => 'your_password',
                               dest => $number,
                               msg => "$message",
                               timezone => "Australia\/Sydney",
                               test => $test,
                             ],
                            );
      
      
        # Response code and diagnostic output is passed as a single string output from the page
        my($retval, $response) = $resp->content =~ m/Status: (\d\d) (.*)/;
      
        print "$response\n";
      
        return $retval;
      }
      
      my $number;
      my $message;
      my $test = 0;
      if (!GetOptions('number=s' => \$number,
                      'message=s' => \$message,
                      'test:i' => \$test,
                      'help' => sub { usage(); exit 0; }
                     )) {
        exit 1;
      }
      
      if (!defined($message)) {
        print "You need to specify a message to send\n";
        exit 1;
      }
      
      # Add date in to the message so we know when it was sent (it may get delayed along the way)
      $message = get_date() . " " . $message;
      
      # Truncate the message to 160 characters, we don't want to span two SMSes
      $message = substr($message, 0, 160);
      
      # Translate the leading zero in the mobile number to 61 (international-style, but without leading plus-sign)
      $number =~ s/^0/61/;
      
      # Send the message
      my $r = send_sms($number, $message, $test);
      
      exit($r);
  • Now we can go ahead and use this in a contact definition. Assuming you already have notification-capable services defined, this will "just work".

    define contact {
        contact_name tomoya
        alias Tomoya Okazaki
        service_notification_period 24x7
        service_notification_options w,u,c,r
        service_notification_commands notify-by-sms
        host_notification_period 24x7
        host_notification_options d,u,r
        host_notification_commands host-notify-by-sms
        pager 0411999999
        email tomoya.okazaki@clannad.or.jp
    }

Instant Messaging

SMS is great for getting attention, but it's overkill when you're at work and can be contacted in some other way. We already have a local Jabber server in the office with a chat session, so the natural alternative was to setup a robot that can send us messages. As well as being cheaper than sending an SMS, it integrates well with our workflow. It's easy to ignore if you're working on something else, but sends messages where you expect to see them when you want them.

Components

As an additional feature, we decided that Nagios should attempt to send an instant message if you're at your desk, otherwise you'll get an SMS. To do this, the Jabber integration needs a couple of components.

Presence-sensing robot

The robot presents itself as a "Buddy" so it can see your away-status. We usually set our Jabber client to report us as being Away if we're idle for more than five minutes. The robot is a perl script that polls for information from the server about who's signed in and what their status is.

The jabber robot runs independently and simply updates a set of files according to who's Online according to the server. How you start this is up to you. We use our configuration management system to ensure the process is running, but an initscript may be more appropriate for other circumstances.

  • jabber_robot

    #! /usr/bin/perl -w
    
    use Net::XMPP;
    use Data::Dumper;
    use strict;
    
    use constant SERVER    => 'jabber.local';
    use constant PORT      => 5222;
    use constant USERNAME  => 'nagiosbot';
    use constant PASSWORD  => 'some_password';
    use constant RESOURCE  => 'bot';
    use constant NOTIFY_DIR => '/var/lib/nagios-buddies-available';
    
    $SIG{HUP} = \&Stop;
    $SIG{KILL} = \&Stop;
    $SIG{TERM} = \&Stop;
    $SIG{INT} = \&Stop;
    
    my $Con;
    
    # Check paths.
    if (! -d NOTIFY_DIR) {
            mkdir NOTIFY_DIR, 0700 or
                die "ERROR: Unable to mkdir  '" . NOTIFY_DIR . "'\n";
    }
    chdir NOTIFY_DIR or die "ERROR: Unable to chdir to '" . NOTIFY_DIR . "'\n";
    
    EventHandler();
    
    sub ConnectClient
    {
            $Con = new Net::XMPP::Client();
    
            $Con->RosterDB();
            $Con->PresenceDB();
            #$Con->SetPresenceCallBacks(available => \&InPresence,
            #                          unavailable => \&InPresence);
    
            my $status = $Con->Connect(hostname => SERVER, port => PORT);
            if (!(defined($status)))
            {
                print STDERR "ERROR: Jabber server is down or connection was not allowed.\n";
                print STDERR "        ($!)\n";
                exit(1);
            }
    
            my @result = $Con->AuthSend(username => USERNAME,
                                        password => PASSWORD,
                                        resource => RESOURCE);
            if ($result[0] ne "ok")
            {
                print STDERR "ERROR: Authorisation failed: $result[0] - $result[1]\n";
                exit(2);
            }
    
            $Con->PresenceSend();
    
            return $Con;
    }
    
    sub EventHandler
    {
            while(1) {
                    ConnectClient();
                    $Con->RosterRequest();
    
                    while(defined($Con->Process(60))) {
    
                            foreach my $jid ($Con->RosterDBJIDs()) {
                                    UpdateUserStatus($jid);
                            }
                    }
    
                    print STDERR "ERROR: The connection was killed...\n";
            }
    }
    
    sub Stop
    {
        #print "Exiting...\n";
        $Con->Disconnect();
        exit(0);
    }
    
    sub UpdateUserStatus
    {
            my ($jid) = @_;
    
            # Get username.
            my $username = $jid->GetUserID();
            $username =~ s:/,::g;
    
            if (IsOnline($jid)) {
                    open(TMP, '>>', $username) or warn "ERROR: Unable to open file '$username'\n";
                    close(TMP);
                    utime(undef, undef, $username) or warn "ERROR: Unable to update timestamp of '$username'\n";
            } elsif (-f $username) {
                    unlink($username) or warn "ERROR: Unable to delete presence status for '$username'\n";
            }
    }
    
    sub IsOnline
    {
            my ($jid) = @_;
            my $Pres = $Con->PresenceDBQuery($jid);
    
            if (!(defined($Pres))) {
                    return 0;
            }
    
            my $show = $Pres->GetShow();
            if ($show) {
                    # Idle, or away, or dnd, etc.
                    return 0;
            }
    
            return 1;
    }
    
    sub Availability
    {
            my ($jid) = @_;
            my $Pres = $Con->PresenceDBQuery($jid);
    
            if (!(defined($Pres))) {
                    return 'Unavailable';
            }
    
            my $show = $Pres->GetShow();
            if ($show) {
                    # Idle, or away, or dnd, etc.
                    return "Away ($show)";
            }
    
            return 'Available';
    }
    
    sub InPresence
    {
            my ($sid, $Pres) = @_;
    
            $Con->PresenceDBParse($Pres);
    
            my $from = $Pres->GetFrom();
            my $type = $Pres->GetType();
            my $status = $Pres->GetStatus();
            my $show = $Pres->GetShow();
            print "===\n";
            print "Presence\n";
            print "  From $from\n";
            print "  Type: $type\n";
            print "  Status: $status\n";
            print "  Show: $show\n";
            print "===\n";
            print $Pres->GetXML(),"\n";
            print "===\n";
    }
    
    sub InIQ
    {
            my $sid = shift;
            my $iq = shift;
    
            my $from = $iq->GetFrom();
            my $type = $iq->GetType();
            my $query = $iq->GetQuery();
            my $xmlns = $query->GetXMLNS();
            print "===\n";
            print "IQ\n";
            print "  From $from\n";
            print "  Type: $type\n";
            print "  XMLNS: $xmlns";
            print "===\n";
            print $iq->GetXML(),"\n";
            print "===\n";
    }
    
    sub InMessage
    {
            my $sid = shift;
            my $message = shift;
    
            my $type = $message->GetType();
            my $fromJID = $message->GetFrom("jid");
    
            my $from = $fromJID->GetUserID();
            my $resource = $fromJID->GetResource();
            my $subject = $message->GetSubject();
            my $body = $message->GetBody();
            print "===\n";
            print "Message ($type)\n";
            print "  From: $from ($resource)\n";
            print "  Subject: $subject\n";
            print "  Body: $body\n";
            print "===\n";
            print $message->GetXML(),"\n";
            print "===\n";
    }

Message-sending script

The next step is a script that actually does the hard work; sending a message to a user in Jabber.

  • notify_via_jabber

    #!/usr/bin/perl -w
    #
    # Author David Cox
    # Created from various code examples found on the web
    # Last Modified 08/06/2002
    # Feel free to use or modify as needed to suit your needs
    #
    # Modified by Anchor
    
    use strict;
    use Net::Jabber qw(Client) ;
    use Net::Jabber qw(Message) ;
    use Net::Jabber qw(Protocol) ;
    use Net::Jabber qw(Presence) ;
    
    my $len = scalar @ARGV;
    
    if ($len ne 2) {
       die "Usage...\n notify [jabberid] [message]\n";
    }
    
    use constant SERVER    => 'jabber.local';
    use constant PORT      => 5222;
    use constant USER      => 'nagiosbot';
    use constant PASSWORD  => 'some_password';
    use constant RESOURCE  => 'bot';
    use constant MESSAGE   => $ARGV[1];
    #######################################################
    # MAXWAIT is used because the send message function didn't seem to
    # like being called to fast. The message would be sent unless I waited a second
    # or so. You can experiment with it but I just went with 2 seconds.
    #######################################################
    use constant MAXWAIT   => 2;
    
    my $connection = Net::Jabber::Client->new();
    $connection->Connect( "hostname" => SERVER,"port" => PORT )  or die "Cannot connect ($!)\n";
    
    my @result = $connection->AuthSend( "username" => USER,"password" => PASSWORD,"resource" => RESOURCE );
    if ($result[0] ne "ok") {
     die "Ident/Auth with server failed: $result[0] - $result[1]\n";
    }
    $connection->PresenceSend();
    
    die "Uh-oh - something has gone wrong with the connection\n"
      unless(defined($connection->Process(2)));
    
    my $message = Net::Jabber::Message->new();
    $message->SetMessage( "to"      => $recipient,
                          "subject" => "Notification",
                          "type"    => "chat",
                          "body"    => MESSAGE);
    
    $connection->Send($message);
    die "Uh-oh - something has gone wrong with the send\n"
      unless(defined($connection->Process(2)));
    sleep(MAXWAIT);
    
    $connection->Disconnect();
    exit;

Pulling it together

Now to connect it all to Nagios. The script first needs to determine whether you're online or not. Then send the message by Jabber if you're online, or by SMS if you're not. Some of the paths are hard-coded and dependent on the previous scripts.

  • notify_via_jabber_or_sms

    #!/bin/bash
    #
    # Send notifications via Jabber or SMS if person is not available.
    
    NOTIFY_DIR=/var/lib/nagios-buddies-available
    JABBER_SCRIPT=/usr/local/sbin/notify_via_jabber
    SMS_SCRIPT=/usr/local/sbin/send_sms
    
    help() {
            echo "Usage: $0 jabber_id mobile_number message" 1>&2
            exit 1
    }
    
    notify_via_jabber() {
            local jabber_id=$1
            local message=$2
    
            local jabber_name=$(echo $jabber_id | cut -d@ -f1)
            local status_file="$NOTIFY_DIR/$jabber_name"
    
            TMPFILE=`mktemp -t timestamp.XXXXXX` || return 1
            touch -d 'now - 2 minutes' $TMPFILE || return 1
    
            if [ -e "$status_file" -a "$status_file" -nt "$old_file" ]
            then
                    # Buddy is online.
                    $JABBER_SCRIPT "$jabber_id" "$message"
                    retval=$?
                    echo "Return value: $retval : $JABBER_SCRIPT '$jabber_id' '$message'"
    
                    return $retval
            else
                    return 1
            fi
    }
    
    notify_via_sms() {
            local mobile_number=$1
            local message=$2
    
            $SMS_SCRIPT --number "$mobile_number" --message "$message"
            retval=$?
            echo "Return value: $retval : $SMS_SCRIPT --number '$mobile_number' --message '$message'"
    
            return $retval
    }
    
    cleanup() {
            trap - $SIGNALS
    
            if [ -n "$TMPFILE" ]
            then
                    rm -f "$TMPFILE"
            fi
    }
    
    # Check arguments.
    if [ $# -lt 3 ]
    then
            help
    fi
    
    jabber_id=$1
    mobile_number=$2
    message=$3
    
    SIGNALS="EXIT"
    trap cleanup $SIGNALS
    
    (
    
    notify_via_jabber "$jabber_id" "$message" || send_sms "$mobile_number" "$message"
    
    ) 2>&1 | logger -s -t 'notify_via_jabber_or_sms' --
  • nagios command definition

    define command{
        command_name notify-by-jabber-or-sms
        command_line /usr/local/sbin/notify_via_jabber_or_sms $CONTACTNAME$@jabber.local $CONTACTPAGER$ "$NOTIFICATIONTYPE$ - $SERVICEDESC$ on $HOSTNAME$ is $SERVICESTATE$ : $SERVICEACKCOMMENT$ : $SERVICEOUTPUT$"
    }

Escalation paths

You could probably fill a small book with all the details about configuring escalations in Nagios and what you can do with them. While there's not enough room for that here, it's worth describing how Anchor uses escalations and why we think they're a big benefit. Some of the terminology here is Nagios-specific and won't mean anything if you're not already familiar with much of how Nagios works.

Before we had escalations setup, all staff would receive notifications when things broke. While this meant everyone knew what was happening, there was no clear responsibility. As a result, it was entirely possible for problems to go unattended while everyone assumed someone else would take care of it. This was fixed by having a roster of responsibility for non-office periods, but all staff still received notifications. Having your staff twitch every time their phone buzzes is not a healthy state of affairs.

Our solution to this is to maintain a dynamic contactgroup, and the easiest way (for us) to do this is to take advantage of Nagios' ability to import other config fragments (a monolithic config file is messy to manage and less flexible). We simply import the contactgroup definition appropriate for whoever's on-call.

  • nagios.cfg

    cfg_file=/etc/nagios2/contactgroup-oncall.cfg

The contactgroup for most notifications is called anchor, and changes at 6pm each day depending on who's on-call for the night. The sysadmin on-call uses their regular contact details, while everyone else gets Jabber messages via a separate contact definition which only uses the notify_via_jabber command. A typical contactgroup definition looks like this:

  • contactgroup-oncall.cfg

    define contactgroup {
        contactgroup_name anchor
        alias Tomoya on-call
        members tomoya,nagisa-jabber,kyou-jabber,ryou-jabber,kotomi-jabber,fuuko-jabber,tomoyo-jabber
    }

Exactly how you implement a system like this will depend on your own requirements. This happens to work well for us and we update the configuration files a couple of times a day with crontab entries. A different solution would be more appropriate if you change your contactgroups more/less frequently.


See also

References/External Links