Author Archive

SSH ControlMaster: The Good, The Bad, The Ugly

Wednesday, February 24th, 2010

Do you love SSH for the good it has done for mankind, but get annoyed by how long it takes to establish a connection over a high-latency connection? Perhaps you have a process that needs to make thousands of SSH connections, and you’d like a little extra speed from the whole thing. Either way, ControlMaster is your new best friend.

The concept is very simple — rather than each new SSH connection to a particular server opening up a new TCP connection, you instead multiplex all of your SSH connections down one TCP connection. The authentication only happens once, when the TCP connection is opened, and thereafter all your extra SSH sessions are sent down that connection.

If you’re SSHing between machines on the same LAN, or otherwise a short ping away, you probably wouldn’t notice the difference — the round-trip times are negligible. However, when you’re doing transcontinental SSHing (which we do often, when we’re managing customer machines in the US), it’s a godsend. On some trivial benchmarking I did when validating ControlMaster for our use, I found that we were saving nearly 2.5 seconds per connection — a drop from 3.3 seconds to 0.8. Mighty convenient.

It’s simple to use, too. If you just want to enable “opportunistic” multiplexing, you can do something as simple as this in your SSH config:

Host *
ControlMaster auto
ControlPath ~/.ssh/cm_socket/%r@%h:%p

Then mkdir ~/.ssh/cm_socket, and you’re away. Any time a connection to a remote server exists, it’ll be used as the master for any other connections. Perusal of the ssh_config(5) manpage should give you the necessary hints to setup more restrictive configurations. If you need to disable control master for a given connection (the reasons why this might be necessary will be covered shortly), you can pass -S none to ssh (or set ControlPath none).


Whilst this basic setup is undeniable, pure, distilled awesome, there are some limitations and caveats to beware of. The first, and most important, is that SSH session multiplexing isn’t particularly stable when you try to put a lot of data down it from a lot of connections at once. This came to light fairly early on in my testing, when I stress-tested things by doing about 25 concurrent rsync runs all at once. The result was a large number of rsync sessions going “aiee!” and falling over. So, don’t do that.

The second, semi-related problem, is a simple bandwidth issue. For a given connection latency and TCP configuration, there is a hard limit to how fast you can send data, due to the time it takes to acknowledge the packets being received. When you’re multiplexing multiple file transfers down the one TCP connection, therefore, your total transfer speed will be limited by this TCP speed limit. Once again, it’s unlikely that this will cause you problems on a LAN (where round-trip delays are negligible), but in the high-latency world where connection sharing does the most good from a connection setup perspective, the speed limits will cause much wailing and gnashing of teeth. So, the take home message is: if you’re doing a lot of heavy data transfer over SSH, ControlMaster probably isn’t the solution for your problems. Instead, run multiple concurrent SSH connections, as the TCP speed limits are per-connection, so you can still fill your high-latency gigabit pipe — you just need lots of concurrent connections to do it (see also: BitTorrent).

Finally, there is something of an annoyance with ControlMaster, and it’ll probably confuse you mightily when you first come across it. Because all of your SSH sessions are multiplexed down a single TCP connection initiated by the first SSH session, that first session must stay alive until all of the other sessions are complete. This problem will manifest itself as an apparent “hang” when you log out of the remote session that is acting as the master — instead of getting your local prompt back, SSH will just sit there. If you Ctrl-C or otherwise kill this session, all of the other sessions you’ve got setup to that server will drop, so don’t do that. Instead, when you logout of all the other sessions, the master will then return to the local prompt.

If you’re doing a high volume of SSH connections to a particular remote endpoint, consider setting up a dedicated master connection — that way it’ll always be available (and you don’t have to worry about master logout hangs). I use a simple daemontools service, that runs ssh -MNn user@server. Works an absolute treat.

ERROR: SSH agent has too many keys

Wednesday, December 23rd, 2009

Unfortunately, SSH doesn’t produce this error, although it darn well should…

I just had a Github customer report that they couldn’t access their repos via SSH, despite it all working properly yesterday, and “not having changed anything”. A bit of debug logging and an inspired leap of intuition on the part of another sysadmin in the office, and the answer was quickly found.

First off, the symptoms:

  • Debug logging showed that the user was connecting successfully, presenting six SSH keys (none of which were the key of interest) before disconnecting;
  • The SSH key was in the user’s SSH agent (you can verify this with a quick ssh-add -l);
  • There were more than six keys in the SSH agent

This last symptom is the key point. As an anti-brute-force measure (I assume), SSH won’t allow a user to connect and present more than MaxAuthTries credentials (whether they be passwords or keys) before being forcibly disconnected. The default value for this parameter (if you haven’t realised already) is six.

Whilst this makes a lot of sense for passwords (and a lesser, but still valid, measure for keys) it does mean that you effectively have a hard limit of six keys in your agent simultaneously (at least without using SSH configs to specify a single key to present to the server). Any more than six keys, and you run the very real risk that the key you need to give to a particular server will be number seven in your agent, and all your authentications will fail miserably.

Bumping the value of MaxAuthTries to a much larger value works fine for Github — password auth is disabled, and if you can manage to brute force a key you’re welcome to what you can get — but you certainly can’t rely on inflating MaxAuthTries everywhere to get you out of trouble, so: keep those SSH agents lean, or at least specify IdentityFile for all your servers.

Monitor your servers like it’s 1996

Thursday, December 3rd, 2009

Whilst I’m a fan of using percentages for my disk space checks, sometimes an explicit size is more appropriate. So, you’d expect the following to work nicely:

$USER1$/check_disk -w 5G -c 1G -p /data/foo

If you don’t actually test that this works (by artificially filling your disk and seeing what happens), you may be dismayed to find that you only get alerted when the disk has 5MB of free disk space. Why is this?

Because Nagios, despite the fact that nobody has sweated the megabytes for about a gazillion years, doesn’t support ‘G’ as a suffix for thresholds. Oh, it’ll make a good show of pretending — after all, the output formatting options have ‘GB’ as an option — but nope, for your thresholds it’s “5000M” all the way.

ROCK ON!

Industry Analysts: Putting the “arse” in Analyst

Tuesday, November 24th, 2009

I’ve never been a real fan of the output of big “industry analysis” firms, since their reports never seemed to really tell the whole story, and didn’t match up with my experiences “in the trenches”. Now I know why. A representative sample:

“I see. So, the companies in your magic quadrant, are they all paying clients of yours?”

“Well, yes they are,” He said, proudly.

“Well, if they are all paying clients, then what’s so ‘magic’ about being in the quadrant?”

“The companies are not all rated at the same level, some are rated much higher than others.”

“And should I be surprised to hear that the companies that pay you more so you can afford to have entire teams cover them full-time; you tend to know a lot about, and they tend to get better ratings?”

No answer.

“Maybe you should stop calling it the ‘Magic Quadrant’ and call it what it really is; perhaps ‘The Quadrant of Companies That Can Afford To Be In It’.

Go read the whole article, though, it’s pure gold.

Securing your codez from the wily exploit injectors

Monday, November 23rd, 2009

Remember the good old days, when Melissa and ILOVEYOU were the major virus threats, spreading via e-mail and causing all sorts of embarrassing conversations at work? Or maybe even earlier than that, when the only way you could get a virus was by engaging in risky sex? (I mean Software EXchange, of course… get your mind out of the gutter)

These days, anti-virus protection for e-mail is fairly thorough, and nobody’s really swapping floppies full of 16 colour games at recess. Malware authors have moved on to new and more fertile ground — embedding their junk in web pages, and relying on browser exploits to gain access to computers. Of course, with this method, you can only get infected if you actually visit a page that has an infestation, so the malware authors have two options: either entice you to visit their sites, or modify existing websites that users will visit in the course of their day — legitimate sites that people know and trust, but with a little added infection.

Enticing people to a whole dodgy site is usually just a matter of providing something people love to look at and sticking it in search engines. Since the attacker has to have a stable, identifiable presence for the search engines to direct users to, that can also be used by anti-malware lists like stopbadware.org to protect web users, so this isn’t a particularly effective means of attack, and is waning somewhat in popularity. Far more effective is infecting a legitimate website with some form of malware. How does it happen, though? In our experience, there are four vectors for infection:

  1. Brute-force password guessing, where the attacker has a botnet they control to just repeatedly try lots and lots of usernames and passwords. They’re bound to get lucky sooner or later.
  2. Some sort of web-based exploit, typically a vulnerability in the web application that allows the attacker to run code of their choosing; this is then either used directly to edit files, or bootstrapped into sufficient access to edit files via another method.
  3. Password “scraping”, where the attacker gets direct access to the FTP password for your site. This can either be some sort of malware on the workstation of the web developer (or someone else related to the management of the website) that gets the password off the local machine (in a saved password file, or via a keylogger), or else via the “lost password” functionality provided by the hosting provider. Once the attacker has the FTP password for the site, they are free to login to the live site and make whatever changes they like.
  4. Direct modification of the website code on the client-side computer, relying on the developer not to notice it and then upload the compromised content to the live site. We recently had our first “confirmed” case of this (where the web developer found the malicious modifications in their local copy), and they swear blind they didn’t download the HTML from the live site (which would bring the “infection” onto the local machine from the infected live site — which we’ve seen before, and categorise under vectors 1 and/or 2).

The countermeasures required to combat all these vectors boil down to a few simple precautions.

  • Use strong passwords. (Protects against vector 1) Yes, they’re a pain to manage, but a weak password is just an open invitation to getting repeatedly and painfully owned. Of course, the strongest password is a keypair, which leads us to…
  • Don’t use FTP. (Protects against vectors 1 and 3) The list of reasons for this is long, but for securing your website, FTP is a pain because you can only use passwords[1]. Switch to using SFTP (the file transfer component of the SSH protocol) and you can use public keys, which are, for all practical purposes, unguessable. You should also encrypt your keys with a passphrase, which means that even if the attacker does get access to your workstation and copies the key, it’ll be useless to them — unless they keylog your passphrase, which brings us to…
  • Keep your workstation secure. (Vectors 3 and 4) It seems that attackers have realised that the weakest link in the website security chain is still the Windows desktop, and they’re increasingly hitting it as the first step in taking over websites (if you get the right workstation, you can get the credentials to hundreds or thousands of websites, because one web developer often works on many different sites). So, on any machine you connect to webservers from, you need to be doubly, triply sure that it’s rock solid — and that’s just a matter of following all the good advice out on the web. Antivirus, antispyware and firewall software, constantly running, well-configured, and kept up to date; keep up to date with your application patches, especially for your web browser, e-mail client, and core OS; don’t visit dodgy web sites; and so on.
  • Protect your e-mail. (Vector 3) If someone can get access to your e-mail, they can also get access to your website, by using the password recovery feature (or impersonating you to your hosting company). If they delete the e-mails that are coming in before you notice them, you’ll never know what’s going on, and all the password changes and workstation security in the world won’t help you.
  • Don’t use shared hosting. (Vectors 1 and 3). This might seem an odd thing to say, given that we sell shared hosting, but it is a legitimate way to reduce your vulnerability. If you use a dedicated server (including a VPS), you (or we) can configure it to only allow logins from certain IP addresses, rather than the entire Internet. This means that even if an attacker does get your password (or SSH key), via brute force or sniffing it off your workstation, they can’t login from their own machine because it won’t have an authorized IP address. On shared hosting, this configuration is impractical, because hundreds of people have legitimate access to the server, from a great many different IP addresses.
  • Code responsibly. (Vector 2) It is said that “PHP is great, because its ease of use means that any idiot can produce a security hole — and most of them do”. Whilst this is a little derogatory to the many (several? few? one? PLEASE?) good PHP programmers out there, it is certainly fair to say that the capabilities of many people who write dynamic code aren’t up to the challenge of writing code that is exposed to the extremely hostile conditions that are the public Internet.

    Thus, if you are not familiar with the common security practices and problems with the language or environment that you are developing for, stop right now and go learn a little. There’s plenty of good information out there on the Internet from people who have learnt the lessons the hard way. Celebrate the benefits of literacy by learning from their mistakes rather than having to educate yourself by cleaning up an infected website. If you feel that isn’t something you can commit to, then please, for the sake of the Internet, find someone else to write the code.

  • Keep your web applications patched. (Vector 2) Whilst some sites do use custom-built web applications, many sites choose to use a standard CMS or other application to manage their website. This is great, because hopefully someone else is taking a bit of responsibility for the security of the software, but that doesn’t do you any good if you don’t keep it up-to-date. Far, far too many people install a CMS once, then forget about it. Almost all of these applications have a vulnerability at some point, and not keeping them up to date is absolutely fatal, because once a vulnerability is found in a piece of software, an attacker can typically use Google to find all of the publicly-available instances of the vulnerable software, and quickly attack them all.

    This means that you need to keep yourself well-informed of any security updates for your off-the-shelf web applications. Subscribe to a relevant security announcements mailing list, or ensure that your vendor sends them to you. (If your commercial CMS vendor doesn’t have this ability, find a new CMS vendor).

Websites get compromised all the time, by a variety of methods. You should reinforce your defences, lest you’re the next target.


1. The first person to mention Kerberos or other unused-in-practice authentication schemes in a comment gets a free laughing at. If you think SFTP and SCP aren’t supported in widely used web development programs, try finding something that supports GSSAPI…

I always knew webmin was arse, but this…

Wednesday, November 18th, 2009

This is the output of iptables -L on a webmin-managed box I just saw:

Chain INPUT (policy ACCEPT)
target     prot opt source               destination
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     tcp  --  anywhere             anywhere            tcp flags:ACK/ACK
ACCEPT     all  --  anywhere             anywhere            state ESTABLISHED
ACCEPT     all  --  anywhere             anywhere            state RELATED
ACCEPT     udp  --  anywhere             anywhere            udp spt:domain dpts:1024:65535
ACCEPT     icmp --  anywhere             anywhere            icmp any
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:ftp
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:ssh
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:smtp
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:domain
ACCEPT     udp  --  anywhere             anywhere            udp dpt:domain
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:http
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:pop3
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:imap
ACCEPT     udp  --  anywhere             anywhere            udp dpt:imap
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:https
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:mysql
ACCEPT     tcp  --  anywhere             anywhere            tcp spt:mysql
ACCEPT     tcp  --  anywhere             anywhere            tcp dpts:terabase:samsung-unidex
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:ndmp
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:dnp
LOG        all  --  anywhere             anywhere            LOG level debug prefix `DROPPED = '
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:ftp-data 

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
ACCEPT     tcp  --  anywhere             anywhere            tcp spt:ftp-data dpt:ftp-data
ACCEPT     tcp  --  anywhere             anywhere            tcp spt:ftp dpt:ftp

Lovely that it has all those ports and whatnot opened up, but what’s with the ACCEPT policies?

Webmin: Now with FAILWALL management!

I should have been in marketing.

The Myth of Infinite Cloud Scalability

Tuesday, November 17th, 2009

Recently, I read an article from a fairly prominent “cloud computing” vendor, which contained a line that basically said “Let the cloud worry about your scalability and performance problems”. I nearly snorted my late-mid-morning can of mother out my nose when I read it. Here’s why.

“Let the cloud worry about your scalability” is nothing more than a thinly disguised version of “just throw more hardware at it”. This is a “solution” beloved of salespeople everywhere, because it’s plausible, real easy to say, and makes a whole pile more money for the company providing the hardware. However, while it can be an appropriate solution in the right circumstances, and with appropriate evidence of its effectiveness in those particular circumstances, it usually isn’t the only option, it often isn’t the best option, and sometimes it isn’t an effective option at all.

The dirty little secret of hosting is that your scaling ability is solely determined by your application — the technologies it uses and its internal architecture. Yes, you can probably get more performance or concurrent users out of throwing more hardware at it this time, but sooner or later more memory or faster CPUs isn’t going to do anything useful.

I suppose, in some perverse way, just telling developers that “the cloud will provide” could be construed as a kindness. In the same way that we give high school kids the simplified approximation of motion that is Newton’s Laws, rather than the complicated and fiddly reality of relativity, saying “let the cloud scale you to being the next Facebook” might be a useful approximation to let developers ignore extraneous details and focus on getting things “right enough”.

The vast majority of sites, even those who aspire to be the next Twitter, will never get to anywhere near that scale. Even if it is the goal (and plenty of sites manage to occupy a satisfying — and dare I say it, profitable — niche without needing a second datacentre full of equipment), a new site is only going to get that big by focusing on satisfying users and creating compelling applications.

Spending your time writing Yet Another Key-Value Store is an awesome way to spend a lazy weekend, but when you’re burning your rent cheque and credit rating trying to get your “next big thing” site off the ground, every minute spent not awesomising your user experience is putting you 35c closer to having to go back to working for The Man.

For whatever reason, though, it makes me uncomfortable to lie to people about things like this, even if I might think it’s in their interest. I know, first hand, the shock and pain that comes from finding out that your site, beloved by millions, is suddenly overloaded and unreliable — and, even worse, that throwing hardware at the problem won’t do a damned thing. It’s an awful feeling.

While you can’t be worrying about scaling to a million users when your site has a grand total of three users (one of which is your mum), you have to prepare for it when things starts to take off, and have a plan in place to deal with it.

Sooner or later, you’re going to have to sit down, find the pain points in your current architecture, and work out how to solve them. If you’re not comfortable doing that yourself, then you need smart systems people who know how. I can guarantee you that “the cloud” isn’t going to advise you on how to restructure your file storage so it will horizontally scale to a petabyte of data. Don’t rely on it to scale you out of trouble.

Bringing the Mountain to Mohamed

Friday, November 13th, 2009

I have never in my life been asked, “How do porcupines make love?”. However, I know the answer very well: “very carefully”. In the same vein, when migrating the mass of data that makes up Github, you take your time and you work very, very carefully. Since this sort of migration doesn’t happen every day, and it’s not something you want to be learning on the job, I thought I’d write down my experiences for posterity.

SCRIPT IT!

As a big fan of automation, there wasn’t much chance that this whole thing wasn’t going to be scripted up the wazoo. We just need to copy the filesystem data across, dump the database and load it into the new site… and we’re done. Right?

HA! Not likely. To give you an idea of the scale of this thing, it took close to 24 hours just to do an rsync scan of the repository filesystems, without actually copying any data. Then there’s the database — the events table alone contained approximately 81.5 million records, which took a great many hours to dump from the live database during pre-migration work. It doesn’t take a great mathematician to realise that copying all this data over the Internet while the site was down for business wasn’t going to fly.

Initially, we were going to rely on the bandwidth of a station wagon full of tapes (or a couple of USB drives in a FedEx jet, anyway) to do the initial copying of data. However, due to some technical problems at the old facility, the “average transfer rate” wasn’t very high (the copy to disk took several weeks to complete), and we ended up kicking off a network-based initial sync of the repository data that finished less than half an hour after the drives were plugged into the machines at the new data centre. While I’m still a fan of shipping disks around for large-scale transfers, I won’t discount using the Internet to transfer such a large data set around so quickly next time.

Incrementalism

Since a single real-time copy wasn’t practical, we’d have to look to incremental copying, where we pre-sync as much data as possible before the Big Cutover Day, and then only copy the latest changes while the site is down.

Thankfully, Github’s software design has pretty much all the hooks we needed to make this a straightforward task. For example, we didn’t have to dump the entire events table, because once a row is written it’s never changed — so we only need to dump events that were created since the last dump.

The system also keeps track of the last time a repository was changed, which means that we can ask the database for a list of repositories that have changed since the last sync, which makes for a very simple (and quick!) incremental sync. For a smaller data set we would just use rsync directly, but due to the performance limitations of the previous hosting environment, this took far, far too long to do with just rsync.

So, we can script everything, and there’s the ability to do repeated incremental syncs. What do these scripts look like?

Well, first up, there’s a lot of them. It was best to write separate scripts to synchronise each data set — one for the repositories, one for the events table, one for the rest of the database, one for gists, and so on. This meant that it was fairly trivial to develop these scripts in parallel, and they could be tested and run independently of each other.

Also, each task that had to be performed for a given data set was in its own script, so each step could be tested independently. For example, the repo sync job consisted of one script to collect the list of repos that needed resyncing and write that list to disk, another script to sync a single repository, and a third script to loop over all the repos listed by the first script and invoke the second script for each of them.

The other important properties of these scripts were:

  • We relied heavily on multitasking to overcome bandwidth limitations from a single TCP stream. When you’re copying data over high capacity links, your available transfer rate is constrained more by the round-trip time between the endpoints than the available bandwidth — the longer it takes for an ACK to get back to the sender, the slower your data will flow. So, since we had eight filesystems to copy data from, we fired off eight parallel rsync processes as child processes of the individual scripts.
  • Each script kept track of what it was doing and what it had done, and tried to avoid doing the same work again. The repository syncs kept track of the repositories that had already been copied by means of a timestamp file — when we did a sync, we touched a file and then used the mtime of that file (stat -c %Y ftw!) to determine the start time of the next sync. The events table was straightforward — before each dump, we just ask the destination table where it’s up to, and dump from there. Even the “main” database, which we dumped in it’s entireity each time, was dumped to a file compressed with `gzip –rsyncable` before being rsync’d across, saving a good few minutes of network transfer time on each cycle.
  • If something went wrong during the sync, we knew about it immediately. We wired up a small SMS sending script to send us alerts if the script terminated improperly. This saved us a lot of waiting and watching, because we knew that we’d be told when we had to take notice of what’s going on.
  • Everything was logged. The stdout and stderr of all processes was captured, and the scripts wrote their own log entries to that stream as well as to a “summary” log, like this: echo $(date) processing repo $repo |tee -a $LOGFILE. Any errors were tagged with a unique string and written in a machine-parseable format, so we could re-run any failed components of the sync to ensure that nobody was missed.
  • While there were typically several scripts that had to be run in an appropriate order to make a sync happen, there was always a single script that did everything that needed doing — we never had to run more than one command to get a given sync done.

Once all of these individual scripts had been written, tested, debugged, tested a few more times, and generally fretted over until our nails were chewed to the quick, it was time to assemble the master script. I’m not about to run a dozen scripts to migrate a site when one will suffice. This was particularly important in Github’s case because to minimise downtime we wanted to run several things in parallel, then wait until they’d all finished, then run the syncs that depended on the data we’d synced in the last lot, and so on. Our scripts looked a lot like this:

task1 >logs/task1.log 2>&1 &
task1_pid=${!}

task2 >logs/task2.log 2>&1 &
task2_pid=${!}

wait $task1_pid
wait $task2_pid

task3 >logs/task3.log 2>&1 &
task3_pid=${!}

task4 >logs/task4.log 2>&1 &
task4_pid=${!}

task5 >logs/task5.log 2>&1 &
task5_pid=${!}

wait $task3_pid
wait $task4_pid
wait $task5_pid

There was also a pile of “doing this, now doing this, now doing this” logging (with timestamps) that helped us to get a feel for how long the different parts would take, and where everything was up to.

When we actually performed the cutover, the “main” sync script was running for a total of 27 minutes. Given that we’d given ourselves an hour to get everything across, we were all quite pleased with this outcome.

Putting on the brakes

Whilst all these scripts ran really well, and the background processes made everything run really fast, I must say it was a right pain in the butt to stop things mid-flight when it was necessary. Hitting Ctrl-C only stopped the foreground (controller) script, and all of the children that had been started in the background kept flying along.

Doing this again, I’d make sure all my scripts had traps on SIGINT that killed off all the child processes that they had spawned. In retrospect, this is just a variant of “one script to start everything” — you should only need to do one thing (Ctrl-C) to stop it all, as well.

Also, the timestamp files weren’t handled real well. If you did kill things off mid-run (or, heaven forbid, a script crashed out) then the timestamp files would be wrong, because we just did a straight touch at the beginning of the script. What would have been better would be something like this:

touch stamp.new
do_all_the_work
mv stamp stamp.prev
mv stamp.new stamp

This would make sure that premature death would leave the stamp as-is, while still capturing the true start time of the job (which a simple touch at the end would fail to do).

When Databases Attack

Testing the new site before we let users at it, we found that creating gists wasn’t working right. It turned out that the database dumping script didn’t have the right set of options, and the schemas of the tables weren’t quite right (no autoincrements), and that was giving gist creation conniptions. Thankfully, the bug in the script was quickly spotted and the database dump was re-run. We even managed to get the second dump and load completed before our scheduled maintenance window was finished. If our scripts hadn’t
been broken down by data set, this resyncing process would have been made a whole lot harder because we wouldn’t have been able to easily run just the parts that needed to be redone.

Once we opened the floodgates of the new site, everything ran happily for a minute or two, and then ground to a halt. The whaaaaa? Poke, prod… hmm, the database is running a bit hotter than I’d expect… whoa! 1500 queries active, all against the events table, with the disks working so hard the heads nearly came out the sides of the cases. What’s going on here?

As it turns out, schema insanity had struck again — this time, some of the indexes on the events table had failed to come across. While we know what happened with the main database dump, this one is still a mystery. How did some of the indexes fail to materialise? We’ve gone over the dumps and can’t find how they got lost. We’re putting it down to yet another case of MySQL doing dumb things without telling anyone.

Limiting the impact

As a final small improvement to the migration process, the site was able to into a “read only” mode, so that users could still browse code and pull from repositories while we were migrating. This made the migration a lot less intrusive for users, because a lot of site functions still worked, especially those made by casual users (who would be less likely to know all about the time of the migration).

Lessons Learnt

Here are a few things I’ll definitely do differently next time:

  • Anywhere you’re depending on a third party to execute part of your migration, have a backup plan in case they can’t deliver — and know when you’ll have to execute your backup plan. In our case, knowing exactly how long it would have taken to copy all the data over the Internet and then calculating back, we would have known to start copying over the network a few days earlier than we did.
  • Make sure that synchronisation scripts are as easy to stop as they are to start.
  • Verify the database schemas completely on the destination DB server by manual inspection, as well as dumping them and comparing to what’s on the source DB server.

I wonder when we’ll get our next Github-scale migration…

Load balancing at Github: Why ldirectord?

Saturday, October 31st, 2009

Some comments on Github’s blog post “How We Made Github Fast” have been asking about why ldirectord was chosen as the load balancer for the new site. Since I made most of the architecture decisions for the Github project, it’s probably easiest if I answer that question directly here, rather than in a comment.

Why ldirectord rocks

The reasons for Github using ldirectord are fairly straightforward:

  • I have a lot of experience with ldirectord. Never underestimate the value of knowing where the bodies are buried. In ldirectord’s case, there aren’t many skeletons, but “better the devil you know” is a valid argument. If you’ve got strong experience in making something work (and you’ve managed to make it work), and you don’t have a lot of time for science experiments, then there’s a lot to be said for going with what you know.

    This goes beyond simply knowing what to do when things go wrong, of course. You’ll also know how to install and configure it already, how to monitor it, and so on.

    What’s more, in ldirectord’s case I had already proven that it worked in an architecture almost identical to Github’s, and with a similar load profile. At a previous job, I had ldirectord serving a sustained aggregate of 2500 TCP connections per second on a 128MB Xen VM, passing to a large set of backends in a manner almost identical to Github.

  • Anchor has a lot of experience with ldirectord. Whilst my experiences are one thing, there’s a lot more to building an infrastructure than just setting it up. I like to take holidays as much as anyone, and so there was no point in using something that nobody else in the company had any experience with, if there was something else that we did all know about.

    Thankfully, ldirectord lined up nicely, since it’s what we use for our other load balancing setups (not setup by me, either — these were already in place before I arrived). This meant that there was already a pile of documentation and knowledge amongst the sysadmin team about ldirectord and it’s quirks. Also, being automation junkies, we already had Puppet dialled in to install and configure ldirectord, and we knew exactly how to monitor it.

  • Ldirectord will do the job. With the prior experiences of myself and the rest of the Anchor team, we were confident that ldirectord would do the job, and at the end of the day that’s what really matters.

The Alternatives

It’s all well and good to say “we know it and it works”, but I’m not really expecting anyone to just read that and say “well, OK, I guess we’ll use ldirectord”. In fact, if you apply the above criteria to your own situation, there’s every possibility that you’ll come up with a different answer — and if you’ve never setup a load balancer at all, then you’ve got no experiences to use to guide you.

So, here are the other load balancing options I’ve dealt with, and what I think of them. This might give you a bit of food for thought when choosing your load balancer.

  • keepalived. This is the project closest to ldirectord in terms of functionality and operation. It actually uses the same load balancing “core” as ldirectord, IPVS, part of the Linux Virtual Server project. As such, it performs similarly to ldirectord when it comes to actually redirecting requests to backends, and is another excellent choice for load balancing.

    For Github, though, there wasn’t any benefit in using keepalived. Whilst I used keepalived extensively at my last job, nobody else in at Anchor had had much to do with it. Also, keepalived has a built-in failover mechanism, which we didn’t need because we already use Heartbeat/Pacemaker for all our HA/failover requirements. I also feel that keepalived is more complicated when compared directly to ldirectord, largely because of it’s built-in failover capabilities. That’s not to say that combining Pacemaker and ldirectord is dirt simple, but if you’ve already got Pacemaker on hand anyway…

    If all you needed was a HA load balancer, and had no experience with either ldirectord or keepalived, I’d probably recommend keepalived over ldirectord, as it’s one project and one piece of software to do everything you need.

  • Load-balancing appliances. Sometimes misleadingly referred to as “hardware” load balancers (they’re still chock full of software, kids — and unlike high-end routers, I don’t know of any true L4 load balancer that has it’s forwarding plane entirely in hardware).

    I loathe these things. They’re expensive, restrictive, slow, and generally cause you a lot more pain and suffering than they’re worth. At my last job, one of my projects was to convert most of one of our existing clusters from a load-balancing appliance to use keepalived. Why would we do this? Because the $100k worth of appliance wasn’t capable of doing the job that $15k worth of commodity hardware and an installation of keepalived were handling with ease — and with capacity to spare. That cluster was our smallest, too, with probably only 2/3 the capacity of the other clusters run by keepalived.

    At the job where I had ldirectord handling 2500 conn/sec, we had also previously used a load-balancing appliance, which was supplied and managed by the hosting provider. It was a management nightmare — we couldn’t get any useful statistics out of it at all, like the conn/sec coming in or going out, and we couldn’t usefully adjust the weightings of each backend (to tune how many connections were going to each different sort of machine) or manage the system in real-time. When we switched to using ldirectord, a small shell script (involving watch and ipvsadm, mostly) was all it took for the CTO to be able to watch exactly how the cluster was performing, in real time, throughout the day. He loved the visibility — and the fact that we were saving several hundred dollars a month didn’t hurt, either.

  • haproxy. While we use haproxy extensively within Github, I don’t think haproxy is the right solution as the front-end load balancer for a high volume website. Being a proxy, rather than a simple TCP connection redirector, it has much larger overheads in CPU and memory, and adds more latency to the connections. All of Github’s load balancing is being done out of one small VM, and it barely raises a sweat. The return traffic doesn’t even go back through the load balancer at Github, since we’re using a really neat mode of IPVS that allows the traffic to return to the client directly. While you can throw hardware at the load balancing problem, I still prefer to be efficient where possible.

    Since haproxy makes a second TCP connection, rather than just redirecting an existing one, it mangles the source IP address information — and while you can work around that in HTTP with custom headers, that doesn’t work for other protocols like SSH. I cringe at the thought of trying to defend against a DDoS attack when the most useful piece of diagnostic information (the source IP) can’t be correlated against the actions of an attacker on the site.

    If all you know is haproxy, and you’re running a low-volume site that only has to deal with HTTP(S), then haproxy will probably do the job — it’s certainly handling more connections inside Github than most sites will ever see. However, I’d recommend getting someone who does systems administration full-time (like us!) to install and manage a real load balancer like ldirectord rather than use haproxy, along with keeping your other basic infrastructure on track. Wouldn’t you rather be developing new features rather than dealing with this stuff?

So, there’s one geek’s opinions on load balancing. Questions and comments appreciated, and if you’d like to know more about any part of the Github architecture (or any other aspect of systems administration), please let us know in the comments and I’ll whip up some more blog posts.

Virtualisation: It’s a Technology, not a Religion

Wednesday, September 30th, 2009

It’s been interesting to look at the press coverage, blog posts, and tweets surrounding the move of Github to an Anchor-managed infrastructure — I’ve never worked on something so public before. I think the article about “Vampire Programmers” has been my favourite so far.

The ZDnet article on the Github move gave me a wry chuckle, though. It made it sound like the move signified some sort of rejection of the Church of the Hypervisor — that virtualisation had been tested and found wanting. In actual fact, there’s more virtual machines running in the Github infrastructure now than there were previously, providing a lot of very essential services.

I really don’t think of myself as a virtualisation nay-sayer. I started using virtual machines with User-Mode Linux, back before anyone outside of Cambridge had ever heard of Xen, and I got on board with Xen back in the 2.0 days. I’ve introduced widespread virtualisation at two previous jobs, I was a big supporter of the use of virtualisation at my last job, and I’ve been working on Anchor’s High-Availability VM product recently. Virtualisation hater I ain’t.

Conversely, though, I don’t think VMs are the answer to all the world’s problems. They’re a fantastic opportunity for a lot of sites: everyone can be running on high-quality, server-grade hardware (redundant power, hardware RAID, fast busses, etc) without the need to either purchase or maintain that hardware. Furthermore, each VM, by virtue of it’s isolation, is more easily managed and scaled independently of the other VMs. Need more memory? Allocate it. This box is getting a little overloaded? No problem, just move a VM to another piece of hardware.

The simple fact is that very, very few sites need a whole dedicated server — even an entry-level server is massive overkill for most sites. In this situation, you can either:

  • Spend the extra money, assuming that you’ll grow and recoup those costs;
  • Buy a cheaper machine, either a basic desktop machine or second-hand server, and take the hit in reliability;
  • Use shared hosting, where everyone’s on the same OS installation (which has tradeoffs in control and isolation); or
  • Use a virtual machine.

Unsurprisingly, I like the latter option. It saves you money, avoids the reliability headaches of cheaper hardware and the management headaches of shared hosting.

Management is the big on-going cost of most sites. Virtualisation simplifies that by isolating different sites and services from each other, so that when it comes time to scale them, it’s not a big job. Most people who’ve been working as a developer or sysadmin will be able to recall the unpleasant feeling when that big-ball-of-wax that everyone calls “the server” starts to run out of huff, and there’s no better hardware to put it on, and no more software optimisation to be done. The call goes out, “move some services to another server”. Damn.

See, when everything’s on the one machine, they intertwine and become hard to separate. That little hack that Roger The Talented Intern put in to make mail processing run faster? That involved digging into the SMTP server queue and pulling out messages directly; if you separate the web server and the mail server, that’ll break — but I bet you don’t find that out until you move.

I hate doing archaeology on these sorts of machines, because it’s guaranteed that things will break, tempers will run hot, and sadness will result. The cost of doing the move (in IT staff time, downtime, customer and staff dissatisfaction, and so on) can easily equal or exceed that of the hardware itself — and yes, I’m still talking about good-quality, server-grade hardware here. People are expensive, and good people even more so.

Instead, if you run logically separate services in separate VMs, when the time comes to scale something, it really is a piece of cake to migrate a VM — shutdown, copy the disk image, boot it back up. Piece of cake. Sure, there’s some overhead in running those separate services in VMs, and yes, you’ll be looking to buy a second machine sooner than you would otherwise, but again, the savings made by not having to gently tease apart a dozen root-bound systems on a single machine will probably pay for that second machine. Let’s not even consider the costs of another separation in two years time when the services you put onto that other machine need to be separated again…

This use of virtualisation is all well and dandy if you’re one of the vast majority of sites that don’t need to service 125,000 users and 2.5TB of filesystem data. Github, though — they’re one of the (un)lucky few. When you’re using a machine’s worth (or more) of processing power on a single service, there’s no benefit to virtualising that. In Github’s case, there’s four physical machines running just the frontend services — each of which has the same specs as the machines that are running the VMs for the site. Sticking the frontend services into VMs in that case would have been a fruitless move. Similarly for the backend file storage, and the database. They’re all single services consuming a machine’s worth (or more) of resources, so we give them physical machines.

Down the track, as Github grows and individual VMs work harder and need more resources, we’ll first increase the size of those VMs, before making the decision to move a power-hungry VM off onto it’s own physical hardware. That’s an easy move — between the natural isolation provided by virtual machines and the strong configuration management policy we’ve adopted, transitioning from a VM to a physical machine will be painless — and painless systems management is, after all, the aim of the game.

Site links
Anchor
Wiki
Blog
Services
Domain names
Web hosting
VPS
Dedicated Servers
Co-location
Articles
Dedicated Server Purchasing Guide
Dedicated Server Tutorials
Developer Friendly Hosting
Useful Tools