Monitoring VMware ESX System Health and Hardware using Nagios

Running a reliable hosting infrastructure requires you have tools and processes in place to catch hardware problems as soon as they appear. Sometimes it is even possible to catch them before they become a major problem.

This guide will detail how to configure the NRPE client on a VMware ESX host and a custom check script to monitor the raid array health of direct attached storage. It is possible to use this guide to enable additional checks using NRPE on the VMware ESX server.

Installing the Nagios NRPE software

To monitor any system effectively with Nagios you want the NRPE daemon installed on that host.

To install NRPE on the COS (The COS is a custom version of Red Hat Enterprise ES 3) you will need some suitable Nagios RPMs. Once you have sourced the RPMs you will then need to install them on the COS. Installation can be done by logging in via SSH as the root user to the VMware ESX server. This guide shows you how to enable SSH root logins securely on the VMware ESX Server.

  1. Download the latest nagios-nrpe & nagios-plugins package for Red Hat Enterprise Linux 3 i386 from nagios-nrpe packages & nagios-plugins packages from the DAG RPM Repository.

  2. Upload the packages to your VMware ESX server using SCP.
  3. Install the packages on your VMware ESX server using the RPM command.

    rpm -Uvh package1-name.rpm package2-name.rpm
  4. Enable the NRPE daemon to start after a COS restart.

    chkconfig nrpe on

Configuring a specific hardware check scripts

Now you need to configure the NRPE daemon with the custom check scripts. At Anchor we use LSI Megaraid SAS Controllers and Adaptec SCSI controllers. This guide will detail raid health checks for Adaptec SCSI or SAS controllers.

To check the raid volume health you will need a suitable hardware raid controller check script from the Nagios Exchange or you will have to write your own.

The following example will use Anchor's custom Adaptec raid health check script script that calls the Adaptec utility arcconf. To install this you will need to obtain this software from the Adaptec support website.

  1. Follow the Adaptec controller manual for installing arcconf on the COS. This is provided in the software resource kit that came with the controller or can be downloaded from the specific controller page on the Adaptec support website.

  2. Ensure you can run arcconf from the console on the COS, and interact with the raid controller.

    arcconf GETCONFIG 1

    This will display the current settings on the controller 1.

  3. Download Anchor's customer Adaptec controller check script. check-aacraid.py and upload it to

    /usr/local/sbin/
    on the VMWare ESX Server COS.
  4. Now check you can run the custom check script.

    /usr/local/sbin/check-aacraid.py

    This should return data similar to below assuming your RAID array is in an optimal state.

    Logical Device 0 Optimal,Controller Optimal,Battery Status Optimal,Battery Capacity 100%,Battery Time 155hours
  5. Now you can configure the NRPE configuration to enable this use of this check script. Edit the configuration file /etc/nagios/nrpe.cfg.

  6. Append this line to the configuration.

    command[check_aacraid]=/usr/local/sbin/check-aacraid.py
  7. Also edit the sudoers configuration to allow the NRPE daemon to run the check script as root without a password. Run

    visudo

    and add this line

    nagios ALL=(root) NOPASSWD: /usr/StorMan/arcconf GETCONFIG 1 *

    save and quit the sudoers file by pressing : and then entering wq and pressing enter.

  8. Now you can start the NRPE daemon.

    service nrpe start

At this point the VMware ESX server has NRPE running and check-accraid script available for use via NRPE. Please refer to the Nagios documentation and Anchor's other public articles in our wiki on Nagios. They cover securing the NRPE daemon in more detail.

Configuring the Nagios Server to use the new check

Now that you have the NRPE client configured you need to enable the check on your Nagios server. This guide will not cover installing or configuring Nagios. That is left as exercise for the reader to follow up on.

Below are specific configuration snippets to make the job of configuring the Nagios server slightly easier.

  1. Create the per-machine service configuration. Your contact_groups and servicegroups will differ, as will the host_name.

    define service {
            use aacraid-service
            host_name YOUR_ESX_HOST
            contact_groups noc_staff
            servicegroups escalate_noc_staff
    }
  2. Now configure the generic template service the per-machine service entires will include. Your entire will likely differ to meet your Nagios configuration requirements.

    define service {
            use low-service-level
            name aacraid-service
            service_description aacraid
            check_command check_aacraid
            register 0
            notification_interval 3600
    }
  3. Configure the check command that is called via NRPE.

    define command {
            command_name check_aacraid
            command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -u -c check_aacraid -t 60
    }
  4. Add your VMware ESX host to Nagios and ensure you can communicate with NRPE to it.
  5. You should now have a working Nagios check to monitor your Adaptec hardware raid direct attached storage volume.

Expanding the above example to monitor other aspects of the system

It is possible to expand upon the above check to monitor many other aspects of your VMware ESX server. You can monitor your VMFS free space, system logs, the host based firewall for changes, logged in users, COS load, along with hardware sensors using lm_sensors.

For system resource usage and trend analysis I recommend using SNMP to pull out data from the VMware ESX server.


See also:

References/External Links