LCA day 3 – High Availability

January 20, 2012 Technical, General

Thursday was more of a “practical” day, with plenty of hands-on hacking. This is nothing new, but nowadays you’re more likely to talk about running a bittorrent client on your bluetooth headset than linux on your toaster. There’s some genuinely awesome, really cool hacks out there (Android and Arduino is where a lot of it’s at), but they’re unlikely to help us give you 99.8% uptime. 🙂

Instead, we’ll have a really quick rundown of the high availability (HA) and virtualisation talks, and why it’s a good thing we sent a sysadmin along to them.

Complexity is your biggest enemy when trying to build reliable systems. Complex systems tend to be flaky, and that means they’re unpredictable. Unpredictable systems are bloody hard to support and rely upon. You won’t read this in all the you-beaut cloud services literature, but highly available systems are complex. Really, really complex.

This is all manageable, but it means your staff need to be trained with an intimate understanding of everything, top to bottom. When you’re unfamiliar with it, the HA stack on linux is like the bogeyman. It scares the living daylights out of you, and you try to pretend that if you close your eyes it’ll just go away. This is okay most of the time, but for a company like Anchor it would leave you dependent on a small team of HA gurus when things go wrong.

Thank $DEITY for the High Availability Sprint at LCA. Anchor can train you in The Way Of The Cluster if you so desire, but an enlightenment session from the jedi grandmasters is immeasurably valuable. Knowledge breeds confidence, and these things translate to a more effective sysadmin. If you’re an Anchor customer with an HA system, it means we can support you better, and respond faster when there’s a problem. Everyone wins!

To wrap up, a quick look at the presentation on Ganeti, software for management of a cluster of virtual machines.

We evaluated Ganeti for our needs a couple of years ago as a VM solution, and found that it wasn’t mature enough to really be usable. It’s clearly grown up since then, but I think it might be more interesting to discuss why it’s still no good for us.

Most people can probably look at the featureset and determine whether it’s what they need. Magical on-demand clouds of VMs are the “in thing” at the moment, what aren’t they good for? Well, it turns out they’re not much good for web-hosting.

This really became evident several months ago when we tasked a sysadmin with evaluating the various cloud management products on the market (free or otherwise). It’s kinda disappointing, but the truth is that we don’t need 100 instances of the same machine. We certainly don’t want them to be ephemeral. The other benefits touted by cloudy VMs, such as live migration and replication, are nice but ultimately not that useful for us.

In the end we developed a system that met our real needs, as plain as they are: really fast to deploy, fully automated, customisable, comprehensively supported and monitored.