Virtualisation: It’s a Technology, not a Religion

Published September 30th, 2009 by matt

It’s been interesting to look at the press coverage, blog posts, and tweets surrounding the move of Github to an Anchor-managed infrastructure — I’ve never worked on something so public before. I think the article about “Vampire Programmers” has been my favourite so far.

The ZDnet article on the Github move gave me a wry chuckle, though. It made it sound like the move signified some sort of rejection of the Church of the Hypervisor — that virtualisation had been tested and found wanting. In actual fact, there’s more virtual machines running in the Github infrastructure now than there were previously, providing a lot of very essential services.

I really don’t think of myself as a virtualisation nay-sayer. I started using virtual machines with User-Mode Linux, back before anyone outside of Cambridge had ever heard of Xen, and I got on board with Xen back in the 2.0 days. I’ve introduced widespread virtualisation at two previous jobs, I was a big supporter of the use of virtualisation at my last job, and I’ve been working on Anchor’s High-Availability VM product recently. Virtualisation hater I ain’t.

Conversely, though, I don’t think VMs are the answer to all the world’s problems. They’re a fantastic opportunity for a lot of sites: everyone can be running on high-quality, server-grade hardware (redundant power, hardware RAID, fast busses, etc) without the need to either purchase or maintain that hardware. Furthermore, each VM, by virtue of it’s isolation, is more easily managed and scaled independently of the other VMs. Need more memory? Allocate it. This box is getting a little overloaded? No problem, just move a VM to another piece of hardware.

The simple fact is that very, very few sites need a whole dedicated server — even an entry-level server is massive overkill for most sites. In this situation, you can either:

  • Spend the extra money, assuming that you’ll grow and recoup those costs;
  • Buy a cheaper machine, either a basic desktop machine or second-hand server, and take the hit in reliability;
  • Use shared hosting, where everyone’s on the same OS installation (which has tradeoffs in control and isolation); or
  • Use a virtual machine.

Unsurprisingly, I like the latter option. It saves you money, avoids the reliability headaches of cheaper hardware and the management headaches of shared hosting.

Management is the big on-going cost of most sites. Virtualisation simplifies that by isolating different sites and services from each other, so that when it comes time to scale them, it’s not a big job. Most people who’ve been working as a developer or sysadmin will be able to recall the unpleasant feeling when that big-ball-of-wax that everyone calls “the server” starts to run out of huff, and there’s no better hardware to put it on, and no more software optimisation to be done. The call goes out, “move some services to another server”. Damn.

See, when everything’s on the one machine, they intertwine and become hard to separate. That little hack that Roger The Talented Intern put in to make mail processing run faster? That involved digging into the SMTP server queue and pulling out messages directly; if you separate the web server and the mail server, that’ll break — but I bet you don’t find that out until you move.

I hate doing archaeology on these sorts of machines, because it’s guaranteed that things will break, tempers will run hot, and sadness will result. The cost of doing the move (in IT staff time, downtime, customer and staff dissatisfaction, and so on) can easily equal or exceed that of the hardware itself — and yes, I’m still talking about good-quality, server-grade hardware here. People are expensive, and good people even more so.

Instead, if you run logically separate services in separate VMs, when the time comes to scale something, it really is a piece of cake to migrate a VM — shutdown, copy the disk image, boot it back up. Piece of cake. Sure, there’s some overhead in running those separate services in VMs, and yes, you’ll be looking to buy a second machine sooner than you would otherwise, but again, the savings made by not having to gently tease apart a dozen root-bound systems on a single machine will probably pay for that second machine. Let’s not even consider the costs of another separation in two years time when the services you put onto that other machine need to be separated again…

This use of virtualisation is all well and dandy if you’re one of the vast majority of sites that don’t need to service 125,000 users and 2.5TB of filesystem data. Github, though — they’re one of the (un)lucky few. When you’re using a machine’s worth (or more) of processing power on a single service, there’s no benefit to virtualising that. In Github’s case, there’s four physical machines running just the frontend services — each of which has the same specs as the machines that are running the VMs for the site. Sticking the frontend services into VMs in that case would have been a fruitless move. Similarly for the backend file storage, and the database. They’re all single services consuming a machine’s worth (or more) of resources, so we give them physical machines.

Down the track, as Github grows and individual VMs work harder and need more resources, we’ll first increase the size of those VMs, before making the decision to move a power-hungry VM off onto it’s own physical hardware. That’s an easy move — between the natural isolation provided by virtual machines and the strong configuration management policy we’ve adopted, transitioning from a VM to a physical machine will be painless — and painless systems management is, after all, the aim of the game.

1
Comment

One comment

  1. Nicholas Orr says:

    great piece of writing there on vm’s.

    My limited experience with vm’s aligns with what you are saying and know that VM is not a magic bullet. Define the task and pick appropriate tools to complete said task :)

    I had a vm with one cpu core available and I kept running in I/O wait issues. As this vm ran on a 4 core physical machine I simply added another cpu to the vm and the wait issue went away.

    Like wise I had a windows vm that was performing badly as a vm, for whatever reason. Moving it to a spare desktop spec machine fixed all the issues. This was a test environment so didn’t really need to have it going 24/7/365 etc…

    Nick :)

Leave a comment

You must be logged in to post a comment.