How to make your website monitorable

One of the topics we touched upon in a previous article about monitoring your systems was adding hooks to your application/website to make it easier to check that everything's working. It need not be complex, and when done correctly lets you identify problems instantly.

It's important to note that this relates only to identifying runtime problems, like database failure or a problem with a filesystem. It doesn't actually do the monitoring for you; for that you need something to go with it, like Nagios.

What are we monitoring?

For this article I'll cover a couple of fairly simple cases. Most websites nowadays are database driven; a MySQL backend with a frontend written in PHP is a common example ripe for the picking. Some very simple possibilities include:

  1. Database server down
  2. Database too overloaded to open a new connection
  3. Out of diskspace and can't handle file uploads

In a more complex application you may need to test a variety of different functions. This will all be specific to your own environment but they're not any harder to setup, you just need to know what your application does and how it may fail. For the purposes of this article we'll cover just the stuff your visitors will see (ie. "is my site up?"). More complex things like automated emails not be sent are beyond our scope, and can't readily be tested with these methods.

I'll present a couple of easy examples which can be readily extended. The first is a vanilla installation of the popular Wordpress blogging software. It's very naive and redundant (Wordpress already has great error reporting), but will establish the basics for what we're doing. The second is part of a site I've written in python called QQC. Editors for translated game scripts can use it to report typos in whatever they're reading, and also upload screenshots of the game causing weird problems. It connects to an XML-RPC application server instead of directly to a database backend, but this is configurable, so we'll make some modifications to better show off what can be done.

Wordpress

This assumes you're running the newest version of Wordpress - I'm running 2.7.1 on my server. You normally set your database parameters in wp-config.php in the top level of your site. You could create this file next to it, called (imaginatively) isItWorking.php

<?php

require "wp-load.php";
echo "Everything is OK!";

?>

Okay, that was a bit too simple, but it works. If you edit your wp-config.php and modify the database password so it's incorrect, Wordpress will throw a big error and you won't see the "OK" message.

So why bother doing this? It's pretty obvious when your wordpress site is broken, you can't read it! The real beauty of this hack is using it with an automated monitoring system, like Nagios. Nagios comes with a plugin called check_http that allows you to check for expected page content at a given URL. With the appropriate notification setup, you can receive an SMS/email/carrier-pigeon the moment there's any problem with your site.

A more realistic example

Let's start by having a look at a typical invocation of the check_http plugin for Nagios (aika is a Debian server, the location of Nagios' plugins varies between platforms):

aika:/usr/lib64/nagios/plugins# ./check_http -H furinkan.meidokon.net -u http://furinkan.meidokon.net/ -s "make web pages like it's 1998"
HTTP OK HTTP/1.1 200 OK - 0.005 second response time |time=0.004610s;;;0.000000 size=3577B;;;0

That's pretty easy, we tell it what host (-H) we want, the URL (-u) to check, and the string (-s) to check for (the page title is "We're going to partymake web pages like it's 1998"). The plugin will wait for 10 seconds by default. If the page doesn't come back in that time we can assume there's a problem.

Selecting things to test

This is something that only the application's developer can do well. It's easy to choose things to test, but it can be hard to choose things that matter and give a correct diagnostic of the site. I'll contrive an example so we can cover a range of possibilities, some of which should be applicable in your situation.

  • Testing HTTP authentication
  • Starting a new session (cookie backend)
  • Reading a file
  • Writing a file
  • Reading an included file with our database connection parameters
  • Connecting to the database
  • Selecting some records from the database
  • Updating records in the database
  • Connect to the XMLRPC backend
  • Call a function on the XMLRPC backend
  • XMLRPC calling a remote self-test!

We'll build these up as snippets of a Python script called status_check.py, then pull it all together at the end into something that will produce nice output suitable for humans and Nagios' plugin.

Testing HTTP authentication

While most people use "htaccess passwords" for access control, perhaps you've got some sort of database backend for user management that you use; LDAP is a classic example. It'd be good to know that this is working okay, so we can test for that. Incidentally, if you'd like to learn how to do it, we've got an article on password-protecting parts of your site using htaccess files.

The first step is to create a .htaccess file to let Apache know that we need to use authentication (this is only necessary if you don't already have the site covered by authentication). I don't have an LDAP server handy, but I do have mod_auth_pam installed, which will do nicely.

AuthName "Restricted access"
AuthType basic

AuthPAM_Enabled on
Require valid-user

Now we write a few lines of Python to read the HTTP username and print it to the page

   1 import os
   2 USERNAME = os.environ.get('REMOTE_USER', 'NO_USERNAME')
   3 
   4 print "Hello %s everything is great!" % USERNAME

Finally, we'll update our invocation of the check_http plugin and see that it works as expected. You'll note that we now check specifically for the username luka in the output. You could use testguy or nagios if you want something more obvious, just make sure it's not a privileged account.

aika:/usr/lib64/nagios/plugins# ./check_http -H furinkan.meidokon.net -u http://furinkan.meidokon.net/status_check.py -s "Hello luka everything is great" -a "luka:megurine"
HTTP OK HTTP/1.1 200 OK - 0.030 second response time |time=0.030117s;;;0.000000 size=178B;;;0

# If someone changes luka's password, or the backend fails for any reason we'll get a WARNING instead:
HTTP WARNING: HTTP/1.1 401 Authorization Required

Session setup

Cookie-based sessions are a built-in feature of PHP, in most cases they'll create a file in /tmp on the webserver to hold the persistent information. In PHP it's as simple as calling session_start() before doing anything. Python doesn't support this directly, but there are simple bolt-on modules you can use to the same effect. One such example can be found here: http://webpython.codepoint.net/cgi_session_class

Assuming we use the session class provided, we can trivially check that new sessions can be created. Failure will cause an exception to be thrown, which we can catch in the finished script.

   1 import session
   2 
   3 sess = session.Session(expires='', cookie_path='/')
   4 sess.data['lastvisit'] = repr("this is a string")
   5 sess.close()

Reading a file

It so happens that my application parses XML files. Specifically, it parses user-uploaded XML files. Putting aside the dangers of blinding handling user-provided content (I know and trust my users), I'd like to test that I can read files as I expect to. For this part, we'll assume that files appear in a relative directory called uploads/.

Once again, any problems in this section will throw an exception, which the script will catch. Assume we leave test.xml lying around for the system's benefit.

   1 xml_file = open('uploads/test.xml', 'rb')
   2 xml_data = xml_file.read()
   3 xml_file.close()
   4 del(xml_data) # free up the memory we just used

Writing a file

My application parses the XML files and writes a binary file to another directory, so we'll check that this also works. We'll delete the file once we're done, to keep things tidy. We can split the exception handling into two sections, to catch file-writing and file-deletion separately.

   1 # We get a unique timestamp to prevent the possibility of race conditions if the script gets runs multiple times simultaneously
   2 import time
   3 timestamp = repr(time.time())
   4 
   5 # File creation
   6 testfile = open('processed_files/aTestFile'+timestamp, 'w')
   7 testfile.write("this is a test file")
   8 testfile.close()
   9 
  10 # File deletion
  11 import os
  12 os.unlink('processed_files/aTestFile'+timestamp)

Database testing

Now we'll start our tests against the database. We assume you don't hardcode your database paramaters, instead keeping them in a conveniently import-able file. I'm using the pgdb Postgres connector here, but you can also use psycopg with no change.

Connection

This is about as simple as it gets, the code speaks for itself. As always, this is wrapped in a try-except block to catch any errors.

   1 import db_params
   2 
   3 connection = pgdb.connect(host=db_params.db_host, database=db_params.db_name, user=db_params.db_user, password=db_params.db_pass)
   4 cursor = connection.cursor()

Selecting some records

My database has a table of usernames for people assigned to edit a script, the table is called proofreaders (because that's what they're doing). It's safe to assume that there's always more than one proofreaders, so we'll check for that. You'll also notice that we're using an assert statement; this is a convenient way to have Python throw an exception based on runtime events. The database might well be functioning correctly, but having no registered proofreaders is something we consider to be an error.

   1 cursor.execute('''SELECT "name" FROM "proofreaders"''')
   2 names = [x[0] for x in cursor.fetchall()]
   3 assert len(names) >= 1, "There are no proofreaders registered in the database"

Exactly what you choose to test for is up to you - the possibilities are limited only by your programming ability.

Inserting or Updating records

This might be something you choose to perform on a dummy table, as the possibility of updating live data may not be tempting for you. In my case, I'm happy to make idempotent updates, so it's no problem.

We're using a couple of features of the Python DB-API here, they should be self-explanatory.

   1 cursor.execute('''UPDATE "proofreaders" SET "name"='USERNAME_ERROR' WHERE "name"='USERNAME_ERROR' ''')
   2 connection.commit()
   3 assert cursor.rowcount == 1, "Expected exactly 1 row to be affected"

XML-RPC server

Rather than connect to the database directly, you may have a middle tier for all your "business logic". This could really be anything, but for this example I'll be using an XML-RPC server. XML-RPC is a lightweight protocol for remote processing, and is very nice for exposing a public API of a system.

XML-RPC is dead easy in Python. We'll use the same logic as for database connection to make assertions about the results we get. One thing you'll notice is that there's no explicit "connect" phase when using XML-RPC. XML-RPC is a stateless protocol that uses HTTP as its underlying transport, so you need to issue a function call before you can detect failure.

In addition to this, we can make as many arbitrary assertions as we want. When it's all pulled together, any specific assertion error messages will be returned as part of the Nagios check.

   1 import xmlrpclib
   2 xmlrpc_server = xmlrpclib.ServerProxy("http://miku.meidokon.net:8000/RPC2")
   3 
   4 vocaloids = xmlrpc_server.getVocaloids()
   5 assert len(vocaloids) >= 1, "No Vocaloids found"
   6 assert 'Hatsune Miku' in vocaloids, "Hatsune Miku not found in Vocaloids during XML-RPC call"

Chances are your XML-RPC server connects to a database backend. What'd be especially cool is having some sort of self-test function exposed through your XML-RPC server; you could use the earlier database checks to do this for you. Then testing your XML-RPC server would be as easy as this:

   1 assert xmlrpc_server.selftest()

Pulling it all together

As promised, a fully-working self-test script. It makes extensive use of exceptions to correctly handle error cases and report them in a consistent manner. Below the source code listing is real output from the script (it should run as plain CGI on your webserver). I've invoked it with wget as it closely resembles how a monitoring system would get the output, but you can also access the page in a web browser and see the output for yourself at any time.

If you actually try this out, you might notice that a failure at any point will stop the testing dead in its tracks. I chose to do it this way as some later tests may rely on earlier ones passing (eg. there's no point trying to run a SELECT on the database if the connection fails). The ramification is that if there's more than one problem, you won't know about them due to this "masking" effect. If you really want to do this, it's up to you to make sure that the "everything is great" message doesn't get printed if there's a failure.

selftest.py

   1 #!/usr/bin/python
   2 
   3 import sys
   4 import os
   5 
   6 
   7 class http_output(object):
   8         '''Build the output in a piecemeal fashion as tests pass/fail'''
   9         def __init__(self):
  10                 self.output_buffer = ''
  11 
  12         def __str__(self):
  13                 return "Content-Type: text/plain\r\n\r\n" + self.output_buffer
  14 
  15         def append(self, line):
  16                 self.output_buffer += line+'\n'
  17 
  18         def append_and_finish(self, line):
  19                 self.output_buffer += line+'\n'
  20                 print self
  21                 sys.exit(0)
  22 
  23 page_output = http_output()
  24 
  25 
  26 # 1. check http auth
  27 USERNAME = os.environ.get('REMOTE_USER', None)
  28 if not USERNAME:
  29         page_output.append_and_finish("* FAIL HTTP auth")
  30 page_output.append("* PASS HTTP auth for %s" % USERNAME)
  31 
  32 
  33 # 2. session handling
  34 import session
  35 try:
  36         sess = session.Session(expires='', cookie_path='/')
  37         sess.data['lastvisit'] = repr("this is a string")
  38         sess.close()
  39 except Exception, data:
  40         page_output.append_and_finish("* FAIL starting the session: %s" % str(data))
  41 page_output.append("* PASS started a session")
  42 
  43 
  44 # 3. read a file
  45 test_filename = 'master.jpg'
  46 try:
  47         f = open(test_filename, 'rb')
  48         data = f.read()
  49         f.close()
  50         del(data) # free up the memory we just used
  51 except Exception, data:
  52         page_output.append_and_finish("* FAIL couldn't read the testfile %s: %s" % (test_filename, str(data)))
  53 page_output.append("* PASS read the testfile %s" % test_filename)
  54 
  55 
  56 # 4. write a file
  57 import time
  58 timestamp = repr(time.time())
  59 test_filename = 'aTestFile'+timestamp
  60 try:
  61         f = open(test_filename, 'w')
  62         f.write("this is a test file")
  63         f.close()
  64 except Exception, data:
  65         page_output.append_and_finish("* FAIL couldn't write the testfile %s: %s" % (test_filename, str(data)))
  66 page_output.append("* PASS wrote the testfile %s" % test_filename)
  67 # 4a. delete the file
  68 try:
  69         os.unlink(test_filename)
  70 except Exception, data:
  71         page_output.append_and_finish("* FAIL couldn't delete the testfile %s: %s" % (test_filename, str(data)))
  72 page_output.append("* PASS deleted the testfile %s" % test_filename)
  73 
  74 
  75 # 5. DB stuff
  76 import pgdb
  77 db_host = 'kaito.meidokon.net'
  78 db_name = 'vocaloids'
  79 db_user = 'hatsunemiku'
  80 db_pass = 'black*rock7Shooter'
  81 
  82 # 5a. connection
  83 try:
  84         connection = pgdb.connect(host=db_host, database=db_name, user=db_user, password=db_pass)
  85         cursor = connection.cursor()
  86 except Exception, data:
  87         page_output.append_and_finish("* FAIL connect to the database: %s" % str(data))
  88 page_output.append("* PASS connected to database")
  89 
  90 # 5b. selection
  91 try:
  92         cursor.execute('''SELECT "name" FROM "proofreaders"''')
  93         names = [x[0] for x in cursor.fetchall()]
  94         assert len(names) >= 1, "There are no proofreaders registered in the database"
  95 except Exception, data:
  96         page_output.append_and_finish("* FAIL SELECTing from the database: %s" % str(data))
  97 page_output.append("* PASS SELECTed from database")
  98 
  99 # 5c. update
 100 try:
 101         cursor.execute('''UPDATE "proofreaders" SET "name"='USERNAME_ERROR' WHERE "name"='USERNAME_ERROR' ''')
 102         connection.commit()
 103         assert cursor.rowcount == 1, "Expected exactly 1 row to be affected"
 104 except Exception, data:
 105         page_output.append_and_finish("* FAIL UPDATEing the database: %s" % str(data))
 106 page_output.append("* PASS UPDATEd the database")
 107 
 108 
 109 # 6. XML-RPC
 110 try:
 111         import xmlrpclib
 112         xmlrpc_server = xmlrpclib.ServerProxy("http://touko.meidokon.net:8002/RPC2")
 113         taglist = xmlrpc_server.get_all_tags('')['data'].values()
 114 except Exception, data:
 115         page_output.append_and_finish("* FAIL XMLRPC call: %s" % str(data))
 116 page_output.append("* PASS XMLRPC call returned data (truncated sample): %s" % str(taglist)[:50])
 117 
 118 
 119 
 120 page_output.append("")
 121 page_output.append_and_finish("Hello %s everything is great!" % USERNAME)

Live output

luka@aika:~% wget -O - 'http://luka:megurine@meidokon.net/selftest.py'
--11:20:54--  http://luka:*password*@meidokon.net/selftest.py
           => `-'
Resolving meidokon.net... 202.4.232.68
Connecting to meidokon.net|202.4.232.68|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]

* PASS HTTP auth for luka
* PASS started a session
* PASS read the testfile master.jpg
* PASS wrote the testfile aTestFile1236644454.8923271
* PASS deleted the testfile aTestFile1236644454.8923271
* PASS connected to database
* PASS SELECTed from database
* PASS UPDATEd the database
* PASS XMLRPC call returned data (truncated sample): [{'count': 13, 'name': 'Satsuki Yumizuka', 'parent

Hello luka everything is great!

11:20:56 (71.66 MB/s) - `-' saved [422]

Wrap up

We've covered a broad range of general examples here that should have some applicability to your own applications. While every application is different, they tend to follow similar patterns, so adapting them shouldn't be too difficult. Even if you don't use a monitoring system, a self-disgnosis feature can make life much simpler when it comes to fixing problems.

If you're one of our dedicated server or virtual private server customers and you'd like to know more about monitoring your site, just ask.


See also:

References/External Links