ElasticSearch is a search service built on Lucene. One of its big claims to fame is that it’s distributed – you can run it over a cluster of servers for high performance and availability.
So you’ve got this cluster of ElasticSearch servers. Specifically, we have this cluster, and we’re managing it for the customer. That means we’re monitoring it with Nagios to make sure it’s working properly. We’re monitoring it every which way and collecting metrics so we can see how it’s performing and how it’s used.
Clusters are, by nature, a complex thing. This means that when it breaks, any extra information is really useful to get it fixed ASAP.
How to get this wrong
An ElasticSearch cluster can host multiple indices for searches. Each index is broken into shards, which are distributed over the cluster. Shards have replicant copies scattered around the cluster for redundancy, in case a server stops working.
ElasticSearch is nifty in that it has a simple red/yellow/green health status for its shards. The overall health of the cluster is derived by taking the worst of all index states, which in turn is the worst of all its shard states.
This maps very cleanly to Nagios’ idea of service health – red/yellow/green is a parallel to Critical/Warning/OK in Nagios. The problem is that this is where everyone else’s monitoring stops.
Our question is: If it’s 2am and your cluster health goes from green to red, what do you need to do to fix it?
Doing it right
ElasticSearch exposes a fair bit of extra information through its API, which can be inspected to derive the health of the cluster much more comprehensively. Having access to this data means that Nagios can give us much more meaningful report when a problem is detected.
Which would you find more useful: This?
WARNING: Index `cat_pictures` health is yellow
WARNING: One or more indexes are missing replica shards: Index 'cat_pictures' replica down on shard 0 Index 'cat_pictures' replica down on shard 1
That’s good, but we don’t stop there. All our monitoring is comprehensively documented. If a sysadmin needs to deal with this situation and they’ve never touched ElasticSearch before, they have guidance to tell them just what this means, where to look, and how to go about fixing it. If they get stuck, the documentation tells them who to call (our resident expert).
In this case, it means a server has probably gone down. You’ll be getting plenty of SMSes to alert you to this fact.
We’re not putting up all the support docs just yet, but if you’re interested in the code then it’s available. We’ve pushed it to a Github repo, so go clone it and go for your life.
Any questions/bugs/requests? Let us know.