Monitoring your ElasticSearch cluster and getting it right

By November 28, 2012 Technical
Today’s post is all about ElasticSearch. We’re hiring at the moment, so you should get in touch if this sort of large scale infrastructure is your cup of tea.

ElasticSearch is a search service built on Lucene. One of its big claims to fame is that it’s distributed – you can run it over a cluster of servers for high performance and availability.

So you’ve got this cluster of ElasticSearch servers. Specifically, we have this cluster, and we’re managing it for the customer. That means we’re monitoring it with Nagios to make sure it’s working properly. We’re monitoring it every which way and collecting metrics so we can see how it’s performing and how it’s used.

Clusters are, by nature, a complex thing. This means that when it breaks, any extra information is really useful to get it fixed ASAP.

How to get this wrong

An ElasticSearch cluster can host multiple indices for searches. Each index is broken into shards, which are distributed over the cluster. Shards have replicant copies scattered around the cluster for redundancy, in case a server stops working.

ElasticSearch is nifty in that it has a simple red/yellow/green health status for its shards. The overall health of the cluster is derived by taking the worst of all index states, which in turn is the worst of all its shard states.

This maps very cleanly to Nagios’ idea of service health – red/yellow/green is a parallel to Critical/Warning/OK in Nagios. The problem is that this is where everyone else’s monitoring stops.

Our question is: If it’s 2am and your cluster health goes from green to red, what do you need to do to fix it?

Doing it right

ElasticSearch exposes a fair bit of extra information through its API, which can be inspected to derive the health of the cluster much more comprehensively. Having access to this data means that Nagios can give us much more meaningful report when a problem is detected.

Which would you find more useful: This?

WARNING: Index `cat_pictures` health is yellow

Or:

WARNING: One or more indexes are missing replica shards:
    Index 'cat_pictures' replica down on shard 0
    Index 'cat_pictures' replica down on shard 1

That’s good, but we don’t stop there. All our monitoring is comprehensively documented. If a sysadmin needs to deal with this situation and they’ve never touched ElasticSearch before, they have guidance to tell them just what this means, where to look, and how to go about fixing it. If they get stuck, the documentation tells them who to call (our resident expert).

In this case, it means a server has probably gone down. You’ll be getting plenty of SMSes to alert you to this fact. 🙂

The code

We’re not putting up all the support docs just yet, but if you’re interested in the code then it’s available. We’ve pushed it to a Github repo, so go clone it and go for your life.

Any questions/bugs/requests? Let us know.

3 Comments

Leave a Reply

This is Steve. One of the awesomely brilliant (and well-bearded) Anchorites.

Hosting and AWS management, support and advice from the Ops team behind GitHub.

And if you're on a DevOps journey, talk to us about getting a cloud infrastructure expert assigned to your Agile team.

Call us on +61 2 8296 5111 or send a note:

Name

Email

Your Message

Free AWS Management?
Awesome! Be quick -
offer ends July 31, 2016!



We're giving away free managed services
to the first 10 customers to sign up for
AWS Cloud Ops Lite this July.

You'll save more than USD$2500
in the first year alone!

Want to know more?

No, I don't want free managed services.