LCA day 3 – High Availability

Published January 20th, 2012 by Barney Desmond

Thursday was more of a “practical” day, with plenty of hands-on hacking. This is nothing new, but nowadays you’re more likely to talk about running a bittorrent client on your bluetooth headset than linux on your toaster. There’s some genuinely awesome, really cool hacks out there (Android and Arduino is where a lot of it’s at), but they’re unlikely to help us give you 99.8% uptime. :)

Instead, we’ll have a really quick rundown of the high availability (HA) and virtualisation talks, and why it’s a good thing we sent a sysadmin along to them.


Complexity is your biggest enemy when trying to build reliable systems. Complex systems tend to be flaky, and that means they’re unpredictable. Unpredictable systems are bloody hard to support and rely upon. You won’t read this in all the you-beaut cloud services literature, but highly available systems are complex. Really, really complex.

This is all manageable, but it means your staff need to be trained with an intimate understanding of everything, top to bottom. When you’re unfamiliar with it, the HA stack on linux is like the bogeyman. It scares the living daylights out of you, and you try to pretend that if you close your eyes it’ll just go away. This is okay most of the time, but for a company like Anchor it would leave you dependent on a small team of HA gurus when things go wrong.

Thank $DEITY for the High Availability Sprint at LCA. Anchor can train you in The Way Of The Cluster if you so desire, but an enlightenment session from the jedi grandmasters is immeasurably valuable. Knowledge breeds confidence, and these things translate to a more effective sysadmin. If you’re an Anchor customer with an HA system, it means we can support you better, and respond faster when there’s a problem. Everyone wins!


To wrap up, a quick look at the presentation on Ganeti, software for management of a cluster of virtual machines.

We evaluated Ganeti for our needs a couple of years ago as a VM solution, and found that it wasn’t mature enough to really be usable. It’s clearly grown up since then, but I think it might be more interesting to discuss why it’s still no good for us.

Most people can probably look at the featureset and determine whether it’s what they need. Magical on-demand clouds of VMs are the “in thing” at the moment, what aren’t they good for? Well, it turns out they’re not much good for web-hosting.

This really became evident several months ago when we tasked a sysadmin with evaluating the various cloud management products on the market (free or otherwise). It’s kinda disappointing, but the truth is that we don’t need 100 instances of the same machine. We certainly don’t want them to be ephemeral. The other benefits touted by cloudy VMs, such as live migration and replication, are nice but ultimately not that useful for us.

In the end we developed a system that met our real needs, as plain as they are: really fast to deploy, fully automated, customisable, comprehensively supported and monitored.

0
Comments

Exciting news from LCA miniconfs

Published January 17th, 2012 by Barney Desmond

Florian Haas gave a talk yesterday at the HA miniconf to present Flashcache, a project that was spawned from Facebook and their desire to squeeze more performance out of their databases.

The basic concept is to use any SSD device as a cache in front of slower rotational media. This is similar to commercial products such as LSI’s Cachecade, but implemented as a linux device-mapper module (so you wouldn’t be able to boot from such a setup, but that’s unlikely to be a real concern).

One of the nice things about Flashcache is that it’s presented as a plain block device. As well as making for a robust and understandable system, a practical upshot of this is that you can also replicate your cache with DRBD. In large HA database setups, this would mitigate a lot of the cache warmup penalty that you suffer after a reboot or failover event.

Flashcache is also fairly configurable, and exposes a lot of stuff through procfs rather than being a black box.

At the moment you have to build it as an out-of-tree module, so of course it’s not the kind of thing we’ll be rushing into production any time soon. Based on what we’ve seen in the past, I reckon there’s a good chance we’ll see Flashcache in mainline in a year or two if there’s a concerted push on development.

0
Comments

Just because you CAN, Doesn’t mean you SHOULD

Published September 25th, 2009 by matt

(Yeah, I’ve been really slack with the blog posts about Project Starbug, but unfortunately when the choice is between doing the cool stuff, and blogging about it, the blogging tends to lose. I am still planning on writing all about things when things die down. In the meantime…)

Remember when you were a kid, and every time you got a new toy you’d just have to play with it all the time? That mentality doesn’t go away as you grow up, it just gets a little more sophisticated. With new technologies, I’m still very much this way. I remember when I first learnt about flex and bison — for the next six months or so, every programming problem I encountered just had to be solved with a minilanguage implemented in flex/bison. I shudder to think that any of that code might still be out there…

Anyway, this week’s shiny new toy has been Heartbeat / Pacemaker. I’ve played with it a fair bit in the past, but just in two-node (Heartbeat v1) clusters. For Project Starbug, though, I’ve been taking it to new heights of awesome (multi-node, easily expandable HA VM clusters, for example). So, of course, anywhere that a bit of high-availability might be good, I’ve laid it on thick. With the Puppet manifests we’ve got for managing Pacemaker, it’s almost harder not to make something HA (seriously, our Pacemaker manifests are awesome).

Unfortunately, in a couple of places I kinda forgot that some services have their own ways of doing HA, and they’re generally superior to tying a service and an IP together and telling Pacemaker to go do it’s thing. The two services that I’ve just converted back away from Heartbeat are NTP and DNS. Yeah, that’s right — I setup pacemaker resources for our NTP server and DNS server, because I suffer from occasional bouts of acute “shiny toy syndrome”. I’ve now recovered, having learnt my lesson (for now).

0
Comments