Weighing the costs of High Availability
We’re rounding out our series on high availability with a little discussion on the benefits of HA versus the inherent costs. If you’ve been keeping up with the previous articles you’ve probably gotten the impression that it’s a lot of work and easy to get wrong; you’d be correct.
That said, HA definitely has its place, so it’s worth arming yourself with the knowledge to assess when it’s appropriate.
HA systems have obvious financial costs, but there’s a lot more to it than just money. We’ll talk about these first because we think it’s important to have these in mind when you assess the benefits of going down the HA path.
The pattern you’ll notice is that almost everything boils down to complexity. Complexity is your enemy when building reliable systems, but it’s a necessary evil for HA.
Right off the bat you’ll be paying more for your hosting. The exact costs depend on the complexity of your deployment, but we generally find you’re looking at doubling or tripling your costs. In the simplest case you’ll be moving from a single server to a pair of servers. A website will probably need to make use of a load balancer, which is an extra expense.
A highly available solution also takes extra time to setup, which can be a substantial cost for a large deployment.
HA systems tend not to be as agile as their Ordinary Availability brethren. Changes need to be carefully considered, and necessarily take longer if there’s more systems to deal with.
As an example, consider a disk upgrade for a database server. On a non-HA server this can be done easily (and without downtime) if LVM is used and drives can be hot-added. A DRBD cluster will require the same steps, then a manual online extension and resync of the DRBD volume. It’s not terribly difficult, but it’s not simple enough to put into a script that can be blindly run in a matter of seconds.
HA systems are by no means maintenance-free, and many common operations need special attention when using HA. You need sysadmins available that are skilled in managing clusters safely, and this extends to out-of-hours periods as well in case something goes wrong.
Training and changes to internal processes
Anyone who regularly works on your website will have to modify their workflow to handle the HA deployment. This isn’t too onerous if you’re already using good development practices, but it requires a lot of discipline if you’re used to rolling out changes in an ad-hoc manner. Use of automation tools, such as Capistrano and Fabric, is practically mandatory to ensure smooth running.
The benefits of running a highly available environment are straightforward to describe, but actually quantifying them is another matter. This varies heavily on a case-by-case basis, so you’ll need to figure out the numbers for yourself and see where things land.
When done right, you can expect a substantial improvement in SLA terms for an HA system (though it is actually possible for an HA system to cause more downtime than it prevents, due to the added complexity).
It’s important to know the limits of your system, and just what sort of failure you’re trying to mitigate. In most cases you’re protecting against the failure of a single server chassis. Even assuming 100% uptime from a cluster of servers, your datacentre probably only guarantees 99.982% for power feeds (a Tier-3 datacentre). These are worst-case figures though, and perfect uptime on a month-to-month basis is quite normal.
Time to recovery in the event of a failure
This is an easily overlooked benefit that’s really more important than straight uptime. A single (non-HA) server is likely to give many years of reliable service without any problems, and many issues can be avoided by using redundant components such as hotswap power supplies and hard drives, ECC memory, etc. When the server eventually fails, there will be a significant interruption to services as the chassis is replaced and configured.
This is where a highly available system really shines. In the same way that RAID gives you safety net when a drive fails, an HA cluster will keep going when a server fails. There may be a brief interruption as services are migrated to a working node, but downtime is measured in seconds or minutes, instead of hours.
By virtue of the fact that HA clusters are designed to tolerate servers going offline, it’s possible to perform most maintenance operations without downtime, unlike regular systems. This is especially convenient when it comes to hardware changes, as each server can be taken offline and upgraded or replaced at leisure.
We suspect that in many cases, the question is a financial one. It’s an unenviable position to be in if you have to justify a lengthy outage by saying that lost sales or productivity is cheaper than having a highly available environment. That’s ultimately what it boils down to though, and you’ll need to find those hard numbers for yourself. We’re always happy to advise on the technical side of things though. 🙂
This completes our series of articles on high availability, we hope you’ve found them engaging and enlightening. We wouldn’t dare to presume that we’ve addressed everything, so if you’ve got any questions feel free to let us know.