Bringing the Mountain to Mohamed

November 13th, 2009

I have never in my life been asked, “How do porcupines make love?”. However, I know the answer very well: “very carefully”. In the same vein, when migrating the mass of data that makes up Github, you take your time and you work very, very carefully. Since this sort of migration doesn’t happen every day, and it’s not something you want to be learning on the job, I thought I’d write down my experiences for posterity.

SCRIPT IT!

As a big fan of automation, there wasn’t much chance that this whole thing wasn’t going to be scripted up the wazoo. We just need to copy the filesystem data across, dump the database and load it into the new site… and we’re done. Right?

HA! Not likely. To give you an idea of the scale of this thing, it took close to 24 hours just to do an rsync scan of the repository filesystems, without actually copying any data. Then there’s the database — the events table alone contained approximately 81.5 million records, which took a great many hours to dump from the live database during pre-migration work. It doesn’t take a great mathematician to realise that copying all this data over the Internet while the site was down for business wasn’t going to fly.

Initially, we were going to rely on the bandwidth of a station wagon full of tapes (or a couple of USB drives in a FedEx jet, anyway) to do the initial copying of data. However, due to some technical problems at the old facility, the “average transfer rate” wasn’t very high (the copy to disk took several weeks to complete), and we ended up kicking off a network-based initial sync of the repository data that finished less than half an hour after the drives were plugged into the machines at the new data centre. While I’m still a fan of shipping disks around for large-scale transfers, I won’t discount using the Internet to transfer such a large data set around so quickly next time.

Incrementalism

Since a single real-time copy wasn’t practical, we’d have to look to incremental copying, where we pre-sync as much data as possible before the Big Cutover Day, and then only copy the latest changes while the site is down.

Thankfully, Github’s software design has pretty much all the hooks we needed to make this a straightforward task. For example, we didn’t have to dump the entire events table, because once a row is written it’s never changed — so we only need to dump events that were created since the last dump.

The system also keeps track of the last time a repository was changed, which means that we can ask the database for a list of repositories that have changed since the last sync, which makes for a very simple (and quick!) incremental sync. For a smaller data set we would just use rsync directly, but due to the performance limitations of the previous hosting environment, this took far, far too long to do with just rsync.

So, we can script everything, and there’s the ability to do repeated incremental syncs. What do these scripts look like?

Well, first up, there’s a lot of them. It was best to write separate scripts to synchronise each data set — one for the repositories, one for the events table, one for the rest of the database, one for gists, and so on. This meant that it was fairly trivial to develop these scripts in parallel, and they could be tested and run independently of each other.

Also, each task that had to be performed for a given data set was in its own script, so each step could be tested independently. For example, the repo sync job consisted of one script to collect the list of repos that needed resyncing and write that list to disk, another script to sync a single repository, and a third script to loop over all the repos listed by the first script and invoke the second script for each of them.

The other important properties of these scripts were:

  • We relied heavily on multitasking to overcome bandwidth limitations from a single TCP stream. When you’re copying data over high capacity links, your available transfer rate is constrained more by the round-trip time between the endpoints than the available bandwidth — the longer it takes for an ACK to get back to the sender, the slower your data will flow. So, since we had eight filesystems to copy data from, we fired off eight parallel rsync processes as child processes of the individual scripts.
  • Each script kept track of what it was doing and what it had done, and tried to avoid doing the same work again. The repository syncs kept track of the repositories that had already been copied by means of a timestamp file — when we did a sync, we touched a file and then used the mtime of that file (stat -c %Y ftw!) to determine the start time of the next sync. The events table was straightforward — before each dump, we just ask the destination table where it’s up to, and dump from there. Even the “main” database, which we dumped in it’s entireity each time, was dumped to a file compressed with `gzip –rsyncable` before being rsync’d across, saving a good few minutes of network transfer time on each cycle.
  • If something went wrong during the sync, we knew about it immediately. We wired up a small SMS sending script to send us alerts if the script terminated improperly. This saved us a lot of waiting and watching, because we knew that we’d be told when we had to take notice of what’s going on.
  • Everything was logged. The stdout and stderr of all processes was captured, and the scripts wrote their own log entries to that stream as well as to a “summary” log, like this: echo $(date) processing repo $repo |tee -a $LOGFILE. Any errors were tagged with a unique string and written in a machine-parseable format, so we could re-run any failed components of the sync to ensure that nobody was missed.
  • While there were typically several scripts that had to be run in an appropriate order to make a sync happen, there was always a single script that did everything that needed doing — we never had to run more than one command to get a given sync done.

Once all of these individual scripts had been written, tested, debugged, tested a few more times, and generally fretted over until our nails were chewed to the quick, it was time to assemble the master script. I’m not about to run a dozen scripts to migrate a site when one will suffice. This was particularly important in Github’s case because to minimise downtime we wanted to run several things in parallel, then wait until they’d all finished, then run the syncs that depended on the data we’d synced in the last lot, and so on. Our scripts looked a lot like this:

task1 >logs/task1.log 2>&1 &
task1_pid=${!}

task2 >logs/task2.log 2>&1 &
task2_pid=${!}

wait $task1_pid
wait $task2_pid

task3 >logs/task3.log 2>&1 &
task3_pid=${!}

task4 >logs/task4.log 2>&1 &
task4_pid=${!}

task5 >logs/task5.log 2>&1 &
task5_pid=${!}

wait $task3_pid
wait $task4_pid
wait $task5_pid

There was also a pile of “doing this, now doing this, now doing this” logging (with timestamps) that helped us to get a feel for how long the different parts would take, and where everything was up to.

When we actually performed the cutover, the “main” sync script was running for a total of 27 minutes. Given that we’d given ourselves an hour to get everything across, we were all quite pleased with this outcome.

Putting on the brakes

Whilst all these scripts ran really well, and the background processes made everything run really fast, I must say it was a right pain in the butt to stop things mid-flight when it was necessary. Hitting Ctrl-C only stopped the foreground (controller) script, and all of the children that had been started in the background kept flying along.

Doing this again, I’d make sure all my scripts had traps on SIGINT that killed off all the child processes that they had spawned. In retrospect, this is just a variant of “one script to start everything” — you should only need to do one thing (Ctrl-C) to stop it all, as well.

Also, the timestamp files weren’t handled real well. If you did kill things off mid-run (or, heaven forbid, a script crashed out) then the timestamp files would be wrong, because we just did a straight touch at the beginning of the script. What would have been better would be something like this:

touch stamp.new
do_all_the_work
mv stamp stamp.prev
mv stamp.new stamp

This would make sure that premature death would leave the stamp as-is, while still capturing the true start time of the job (which a simple touch at the end would fail to do).

When Databases Attack

Testing the new site before we let users at it, we found that creating gists wasn’t working right. It turned out that the database dumping script didn’t have the right set of options, and the schemas of the tables weren’t quite right (no autoincrements), and that was giving gist creation conniptions. Thankfully, the bug in the script was quickly spotted and the database dump was re-run. We even managed to get the second dump and load completed before our scheduled maintenance window was finished. If our scripts hadn’t
been broken down by data set, this resyncing process would have been made a whole lot harder because we wouldn’t have been able to easily run just the parts that needed to be redone.

Once we opened the floodgates of the new site, everything ran happily for a minute or two, and then ground to a halt. The whaaaaa? Poke, prod… hmm, the database is running a bit hotter than I’d expect… whoa! 1500 queries active, all against the events table, with the disks working so hard the heads nearly came out the sides of the cases. What’s going on here?

As it turns out, schema insanity had struck again — this time, some of the indexes on the events table had failed to come across. While we know what happened with the main database dump, this one is still a mystery. How did some of the indexes fail to materialise? We’ve gone over the dumps and can’t find how they got lost. We’re putting it down to yet another case of MySQL doing dumb things without telling anyone.

Limiting the impact

As a final small improvement to the migration process, the site was able to into a “read only” mode, so that users could still browse code and pull from repositories while we were migrating. This made the migration a lot less intrusive for users, because a lot of site functions still worked, especially those made by casual users (who would be less likely to know all about the time of the migration).

Lessons Learnt

Here are a few things I’ll definitely do differently next time:

  • Anywhere you’re depending on a third party to execute part of your migration, have a backup plan in case they can’t deliver — and know when you’ll have to execute your backup plan. In our case, knowing exactly how long it would have taken to copy all the data over the Internet and then calculating back, we would have known to start copying over the network a few days earlier than we did.
  • Make sure that synchronisation scripts are as easy to stop as they are to start.
  • Verify the database schemas completely on the destination DB server by manual inspection, as well as dumping them and comparing to what’s on the source DB server.

I wonder when we’ll get our next Github-scale migration…

Anchr 2.0

November 4th, 2009
We heartily endorse this event or product!

We heartily endorse this event or product!

Anchr 2.0 makes you want to reach out and touch it; hold it; feel it. Your Anchr 2.0 pulsates with a reassuring rhythm, like that of a heart, but made of silicone instead of striated cardiac muscle.

Anchr 2.0 responds.. it is alive. If you listen carefully you can hear its machinations, at speeds beyond the limits of human ken. Don’t Panic – this is normal, but a helpful voice is always close by when you need it.

Anchr 2.0 is not made, but created. Observe its perfect finish and seamless form. The dull blue glow of security, punctuated by the cerice of backups. Anchr 2.0 fits snugly in the hands. Firm, but also yielding, you cannot discern the boundary; that is the sensation of redundancy. It is comforting.

Anchr 2.0 is communal, it is shared. But! A duality of nature: There is one, but there are also many. That is your Anchr 2.0; there are many like it, but that one is yours.

Anchr 2.0 is… everything you love about webhosting, with less crap

Load balancing at Github: Why ldirectord?

October 31st, 2009

Some comments on Github’s blog post “How We Made Github Fast” have been asking about why ldirectord was chosen as the load balancer for the new site. Since I made most of the architecture decisions for the Github project, it’s probably easiest if I answer that question directly here, rather than in a comment.

Why ldirectord rocks

The reasons for Github using ldirectord are fairly straightforward:

  • I have a lot of experience with ldirectord. Never underestimate the value of knowing where the bodies are buried. In ldirectord’s case, there aren’t many skeletons, but “better the devil you know” is a valid argument. If you’ve got strong experience in making something work (and you’ve managed to make it work), and you don’t have a lot of time for science experiments, then there’s a lot to be said for going with what you know.

    This goes beyond simply knowing what to do when things go wrong, of course. You’ll also know how to install and configure it already, how to monitor it, and so on.

    What’s more, in ldirectord’s case I had already proven that it worked in an architecture almost identical to Github’s, and with a similar load profile. At a previous job, I had ldirectord serving a sustained aggregate of 2500 TCP connections per second on a 128MB Xen VM, passing to a large set of backends in a manner almost identical to Github.

  • Anchor has a lot of experience with ldirectord. Whilst my experiences are one thing, there’s a lot more to building an infrastructure than just setting it up. I like to take holidays as much as anyone, and so there was no point in using something that nobody else in the company had any experience with, if there was something else that we did all know about.

    Thankfully, ldirectord lined up nicely, since it’s what we use for our other load balancing setups (not setup by me, either — these were already in place before I arrived). This meant that there was already a pile of documentation and knowledge amongst the sysadmin team about ldirectord and it’s quirks. Also, being automation junkies, we already had Puppet dialled in to install and configure ldirectord, and we knew exactly how to monitor it.

  • Ldirectord will do the job. With the prior experiences of myself and the rest of the Anchor team, we were confident that ldirectord would do the job, and at the end of the day that’s what really matters.

The Alternatives

It’s all well and good to say “we know it and it works”, but I’m not really expecting anyone to just read that and say “well, OK, I guess we’ll use ldirectord”. In fact, if you apply the above criteria to your own situation, there’s every possibility that you’ll come up with a different answer — and if you’ve never setup a load balancer at all, then you’ve got no experiences to use to guide you.

So, here are the other load balancing options I’ve dealt with, and what I think of them. This might give you a bit of food for thought when choosing your load balancer.

  • keepalived. This is the project closest to ldirectord in terms of functionality and operation. It actually uses the same load balancing “core” as ldirectord, IPVS, part of the Linux Virtual Server project. As such, it performs similarly to ldirectord when it comes to actually redirecting requests to backends, and is another excellent choice for load balancing.

    For Github, though, there wasn’t any benefit in using keepalived. Whilst I used keepalived extensively at my last job, nobody else in at Anchor had had much to do with it. Also, keepalived has a built-in failover mechanism, which we didn’t need because we already use Heartbeat/Pacemaker for all our HA/failover requirements. I also feel that keepalived is more complicated when compared directly to ldirectord, largely because of it’s built-in failover capabilities. That’s not to say that combining Pacemaker and ldirectord is dirt simple, but if you’ve already got Pacemaker on hand anyway…

    If all you needed was a HA load balancer, and had no experience with either ldirectord or keepalived, I’d probably recommend keepalived over ldirectord, as it’s one project and one piece of software to do everything you need.

  • Load-balancing appliances. Sometimes misleadingly referred to as “hardware” load balancers (they’re still chock full of software, kids — and unlike high-end routers, I don’t know of any true L4 load balancer that has it’s forwarding plane entirely in hardware).

    I loathe these things. They’re expensive, restrictive, slow, and generally cause you a lot more pain and suffering than they’re worth. At my last job, one of my projects was to convert most of one of our existing clusters from a load-balancing appliance to use keepalived. Why would we do this? Because the $100k worth of appliance wasn’t capable of doing the job that $15k worth of commodity hardware and an installation of keepalived were handling with ease — and with capacity to spare. That cluster was our smallest, too, with probably only 2/3 the capacity of the other clusters run by keepalived.

    At the job where I had ldirectord handling 2500 conn/sec, we had also previously used a load-balancing appliance, which was supplied and managed by the hosting provider. It was a management nightmare — we couldn’t get any useful statistics out of it at all, like the conn/sec coming in or going out, and we couldn’t usefully adjust the weightings of each backend (to tune how many connections were going to each different sort of machine) or manage the system in real-time. When we switched to using ldirectord, a small shell script (involving watch and ipvsadm, mostly) was all it took for the CTO to be able to watch exactly how the cluster was performing, in real time, throughout the day. He loved the visibility — and the fact that we were saving several hundred dollars a month didn’t hurt, either.

  • haproxy. While we use haproxy extensively within Github, I don’t think haproxy is the right solution as the front-end load balancer for a high volume website. Being a proxy, rather than a simple TCP connection redirector, it has much larger overheads in CPU and memory, and adds more latency to the connections. All of Github’s load balancing is being done out of one small VM, and it barely raises a sweat. The return traffic doesn’t even go back through the load balancer at Github, since we’re using a really neat mode of IPVS that allows the traffic to return to the client directly. While you can throw hardware at the load balancing problem, I still prefer to be efficient where possible.

    Since haproxy makes a second TCP connection, rather than just redirecting an existing one, it mangles the source IP address information — and while you can work around that in HTTP with custom headers, that doesn’t work for other protocols like SSH. I cringe at the thought of trying to defend against a DDoS attack when the most useful piece of diagnostic information (the source IP) can’t be correlated against the actions of an attacker on the site.

    If all you know is haproxy, and you’re running a low-volume site that only has to deal with HTTP(S), then haproxy will probably do the job — it’s certainly handling more connections inside Github than most sites will ever see. However, I’d recommend getting someone who does systems administration full-time (like us!) to install and manage a real load balancer like ldirectord rather than use haproxy, along with keeping your other basic infrastructure on track. Wouldn’t you rather be developing new features rather than dealing with this stuff?

So, there’s one geek’s opinions on load balancing. Questions and comments appreciated, and if you’d like to know more about any part of the Github architecture (or any other aspect of systems administration), please let us know in the comments and I’ll whip up some more blog posts.

Envy our new Leviathan!

October 19th, 2009

Our current rdiff and amanda backup server, KRAKEN, is almost full, so it was time to order a new one. After much wrangling, we finally received LEVIATHAN this morning.

LEVIATHAN is, I assure you, teh hardk0rez - dual xeon 5500-series, 6gb RAM and 12TB usable storage in RAID-10

LEVIATHAN is, I assure you, teh hardk0rez - dual xeon 5500-series, 6gb RAM and 12TB usable storage in RAID-10

I was pushing for PHYREXIAN DREADNOUGHT personally, but LEVIATHAN is acceptable too; the upkeep effort of backup servers is pretty high after all.

New dedicated server upgrade offering

October 10th, 2009

This is, of course, a fantastic idea:
http://en.gentoo-wiki.com/wiki/Using_Graphics_Card_Memory_as_Swap

Anchor loves to stay abreast of the latest performance options. As such, we’re proud to announce a new range of upgrade options for our dedicated server customers that demand the absolute best in performance for their customers.

It makes sense, really. The best our current systems offer is puny DDR2 memory. Just think of what you could do with several gig of GDDR5. That’s right, FIVE! We’re now offering upgrade options with Geforce 320 and Geforce 340 cards. If you order one of our higher-specced (2RU) dedicated servers, you can have two of these puppies strapped together for insane amounts of swappiness.

Stay tuned for more news on how we’re rolling out ButterFS, phase-change cooling, overvolted Core2 Quad servers, and mass-scale SSD RAID-0 arrays for database optimisation.

Interesting failure modes, episode 2501

October 5th, 2009

I got woken up by a SMS for low diskspace the other night on one of our customer’s servers. Okay, so that’s a lie, I never sleep, but the SMS is real.

Oh great, they’re making whoopie on their mailing lists again and making some stupidly huge logfile.

Little did I know just how huge that file was. How about 735gb huge, in the space of 12hrs? This customer is already a bit of an oddball, what with 1.4TiB of usable space in their server. “Oh that’s nothing”, you say. Sure, I’ve got a few TiB of kitten pictures on my machine at home, just like you, but to put things in perspective: 300GiB of space would be “big” for most Anchor customers. SCSI disks cost about $1.70/Gb, compared to about 10c/Gb for SATA.

There was no mailout. No big processing job, and no flood of activity. With a little digging I was able to nail it down to an apache errorlog file. That was a surprise, except for the PHP errors all throughout – some things never change.

[Fri Oct 02 02:39:57 2009] [error] [client 63.82.71.139] PHP Warning: fgets(): supplied
argument is not a valid stream resource in /home/wright/public_html/script.php on
line 15, referer: XXX

Nice work there, guys. You need to learn to check your return values from failure-prone functions.

Strangely, there were no actual active connections, but the process list showed two apache processes going balls to the wall, writing the same error message to the log file ad infinitum. By my reckoning that was over 9000 lines per second – nothing a quick service-restart couldn’t fix, thankfully.

And to actually fix the problem? It’s tempting to dump the file, but we don’t like doing that; it’s just a bit too cowboy for us. I settled for a forced logrotate run, taking about 4hrs and squishing it down to just 4.3GiB – Crisis (and sleep) Averted.

Ooh, bugger…

October 2nd, 2009

And this is why we co-locate in Globalswitch, a top-tier facility with floors that AREN’T MADE OF BALSA WOOD.

Racks are pretty heavy, sure, but they totally wtfpwned those tables there

Racks are pretty heavy, sure, but they totally wtfpwned those tables there

Upping the maternal ante

October 2nd, 2009

We’re gonna need another drinks fridge.

The third refrigeration unit has been ordered, ETA next week

The third refrigeration unit has been ordered, ETA next week

Performance tips – good reading for PHP/mysql devs

October 1st, 2009

I came across this a little while ago; it’s a good little presentation with some interesting points I’d not considered before.

http://www.slideshare.net/techdude/how-to-kill-mysql-performance

If you’re an Anchor customer, I should point out that the ARCHIVE storage engine isn’t available in Redhat’s version of MySQL, which is a damned nuisance. :(

Pyramid of Productivity pt.2

October 1st, 2009
************************************************************
PRESS RELEASE
FOR IMMEDIATE DISTRIBUTION
************************************************************

ANCHOR SYSTEMS SYSADMINS TO SEEK GUINNESS VERIFICATION
AS "MOST-WIRED MOFOS ON THA PLANET"

Maternal pyramid

It’s a good thing we got those stubby-holders, them mothas is ice cold!
Site links
Anchor
Wiki
Blog
Services
Domain names
Web hosting
VPS
Dedicated Servers
Co-location
Articles
Dedicated Server Purchasing Guide
Dedicated Server Tutorials
Developer Friendly Hosting
Useful Tools