vi gangstas!
January 29th, 2010Sampling shamelessly stolen from four fine folk, for your amusement.
We love vi, everyone at Anchor uses it.
‘Cause it’s better than emacs, yo.
.
AWW SNAP, BRO! WE JUST IMPROVED YO GESTURE!!
.
Sampling shamelessly stolen from four fine folk, for your amusement.
We love vi, everyone at Anchor uses it.
‘Cause it’s better than emacs, yo.
.
AWW SNAP, BRO! WE JUST IMPROVED YO GESTURE!!
.
In Brief: Come down to the James Squire Brew house (King Street Wharf, Sydney) for a chance to have a few drinks with guys (and gals) from your friendly hosting company. In addition to this, we’re lucky to have special guests Tom and Scott of Github fame dropping in and you’ll have the opportunity to have some beers, tell stories of 14′ kangaroos, and discuss all things Github.
Anchor will be putting on drinks from 6pm.
Details:
6pm, Monday, Jan 18th
James Squire Brewhouse
22 Promenade King St Wharf, Sydney
Unfortunately, SSH doesn’t produce this error, although it darn well should…
I just had a Github customer report that they couldn’t access their repos via SSH, despite it all working properly yesterday, and “not having changed anything”. A bit of debug logging and an inspired leap of intuition on the part of another sysadmin in the office, and the answer was quickly found.
First off, the symptoms:
This last symptom is the key point. As an anti-brute-force measure (I assume), SSH won’t allow a user to connect and present more than MaxAuthTries credentials (whether they be passwords or keys) before being forcibly disconnected. The default value for this parameter (if you haven’t realised already) is six.
Whilst this makes a lot of sense for passwords (and a lesser, but still valid, measure for keys) it does mean that you effectively have a hard limit of six keys in your agent simultaneously (at least without using SSH configs to specify a single key to present to the server). Any more than six keys, and you run the very real risk that the key you need to give to a particular server will be number seven in your agent, and all your authentications will fail miserably.
Bumping the value of MaxAuthTries to a much larger value works fine for Github — password auth is disabled, and if you can manage to brute force a key you’re welcome to what you can get — but you certainly can’t rely on inflating MaxAuthTries everywhere to get you out of trouble, so: keep those SSH agents lean, or at least specify IdentityFile for all your servers.
Whilst I’m a fan of using percentages for my disk space checks, sometimes an explicit size is more appropriate. So, you’d expect the following to work nicely:
$USER1$/check_disk -w 5G -c 1G -p /data/foo
If you don’t actually test that this works (by artificially filling your disk and seeing what happens), you may be dismayed to find that you only get alerted when the disk has 5MB of free disk space. Why is this?
Because Nagios, despite the fact that nobody has sweated the megabytes for about a gazillion years, doesn’t support ‘G’ as a suffix for thresholds. Oh, it’ll make a good show of pretending — after all, the output formatting options have ‘GB’ as an option — but nope, for your thresholds it’s “5000M” all the way.
ROCK ON!
I’ve never been a real fan of the output of big “industry analysis” firms, since their reports never seemed to really tell the whole story, and didn’t match up with my experiences “in the trenches”. Now I know why. A representative sample:
“I see. So, the companies in your magic quadrant, are they all paying clients of yours?”
“Well, yes they are,” He said, proudly.
“Well, if they are all paying clients, then what’s so ‘magic’ about being in the quadrant?”
“The companies are not all rated at the same level, some are rated much higher than others.”
“And should I be surprised to hear that the companies that pay you more so you can afford to have entire teams cover them full-time; you tend to know a lot about, and they tend to get better ratings?”
No answer.
“Maybe you should stop calling it the ‘Magic Quadrant’ and call it what it really is; perhaps ‘The Quadrant of Companies That Can Afford To Be In It’.
Go read the whole article, though, it’s pure gold.
Remember the good old days, when Melissa and ILOVEYOU were the major virus threats, spreading via e-mail and causing all sorts of embarrassing conversations at work? Or maybe even earlier than that, when the only way you could get a virus was by engaging in risky sex? (I mean Software EXchange, of course… get your mind out of the gutter)
These days, anti-virus protection for e-mail is fairly thorough, and nobody’s really swapping floppies full of 16 colour games at recess. Malware authors have moved on to new and more fertile ground — embedding their junk in web pages, and relying on browser exploits to gain access to computers. Of course, with this method, you can only get infected if you actually visit a page that has an infestation, so the malware authors have two options: either entice you to visit their sites, or modify existing websites that users will visit in the course of their day — legitimate sites that people know and trust, but with a little added infection.
Enticing people to a whole dodgy site is usually just a matter of providing something people love to look at and sticking it in search engines. Since the attacker has to have a stable, identifiable presence for the search engines to direct users to, that can also be used by anti-malware lists like stopbadware.org to protect web users, so this isn’t a particularly effective means of attack, and is waning somewhat in popularity. Far more effective is infecting a legitimate website with some form of malware. How does it happen, though? In our experience, there are four vectors for infection:
The countermeasures required to combat all these vectors boil down to a few simple precautions.
Thus, if you are not familiar with the common security practices and problems with the language or environment that you are developing for, stop right now and go learn a little. There’s plenty of good information out there on the Internet from people who have learnt the lessons the hard way. Celebrate the benefits of literacy by learning from their mistakes rather than having to educate yourself by cleaning up an infected website. If you feel that isn’t something you can commit to, then please, for the sake of the Internet, find someone else to write the code.
This means that you need to keep yourself well-informed of any security updates for your off-the-shelf web applications. Subscribe to a relevant security announcements mailing list, or ensure that your vendor sends them to you. (If your commercial CMS vendor doesn’t have this ability, find a new CMS vendor).
Websites get compromised all the time, by a variety of methods. You should reinforce your defences, lest you’re the next target.
1. The first person to mention Kerberos or other unused-in-practice authentication schemes in a comment gets a free laughing at. If you think SFTP and SCP aren’t supported in widely used web development programs, try finding something that supports GSSAPI…
This is the output of iptables -L on a webmin-managed box I just saw:
Chain INPUT (policy ACCEPT) target prot opt source destination ACCEPT all -- anywhere anywhere ACCEPT all -- anywhere anywhere ACCEPT tcp -- anywhere anywhere tcp flags:ACK/ACK ACCEPT all -- anywhere anywhere state ESTABLISHED ACCEPT all -- anywhere anywhere state RELATED ACCEPT udp -- anywhere anywhere udp spt:domain dpts:1024:65535 ACCEPT icmp -- anywhere anywhere icmp any ACCEPT tcp -- anywhere anywhere tcp dpt:ftp ACCEPT tcp -- anywhere anywhere tcp dpt:ssh ACCEPT tcp -- anywhere anywhere tcp dpt:smtp ACCEPT tcp -- anywhere anywhere tcp dpt:domain ACCEPT udp -- anywhere anywhere udp dpt:domain ACCEPT tcp -- anywhere anywhere tcp dpt:http ACCEPT tcp -- anywhere anywhere tcp dpt:pop3 ACCEPT tcp -- anywhere anywhere tcp dpt:imap ACCEPT udp -- anywhere anywhere udp dpt:imap ACCEPT tcp -- anywhere anywhere tcp dpt:https ACCEPT tcp -- anywhere anywhere tcp dpt:mysql ACCEPT tcp -- anywhere anywhere tcp spt:mysql ACCEPT tcp -- anywhere anywhere tcp dpts:terabase:samsung-unidex ACCEPT tcp -- anywhere anywhere tcp dpt:ndmp ACCEPT tcp -- anywhere anywhere tcp dpt:dnp LOG all -- anywhere anywhere LOG level debug prefix `DROPPED = ' ACCEPT tcp -- anywhere anywhere tcp dpt:ftp-data Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination ACCEPT tcp -- anywhere anywhere tcp spt:ftp-data dpt:ftp-data ACCEPT tcp -- anywhere anywhere tcp spt:ftp dpt:ftp
Lovely that it has all those ports and whatnot opened up, but what’s with the ACCEPT policies?
Webmin: Now with FAILWALL management!
I should have been in marketing.
Recently, I read an article from a fairly prominent “cloud computing” vendor, which contained a line that basically said “Let the cloud worry about your scalability and performance problems”. I nearly snorted my late-mid-morning can of mother out my nose when I read it. Here’s why.
“Let the cloud worry about your scalability” is nothing more than a thinly disguised version of “just throw more hardware at it”. This is a “solution” beloved of salespeople everywhere, because it’s plausible, real easy to say, and makes a whole pile more money for the company providing the hardware. However, while it can be an appropriate solution in the right circumstances, and with appropriate evidence of its effectiveness in those particular circumstances, it usually isn’t the only option, it often isn’t the best option, and sometimes it isn’t an effective option at all.
The dirty little secret of hosting is that your scaling ability is solely determined by your application — the technologies it uses and its internal architecture. Yes, you can probably get more performance or concurrent users out of throwing more hardware at it this time, but sooner or later more memory or faster CPUs isn’t going to do anything useful.
I suppose, in some perverse way, just telling developers that “the cloud will provide” could be construed as a kindness. In the same way that we give high school kids the simplified approximation of motion that is Newton’s Laws, rather than the complicated and fiddly reality of relativity, saying “let the cloud scale you to being the next Facebook” might be a useful approximation to let developers ignore extraneous details and focus on getting things “right enough”.
The vast majority of sites, even those who aspire to be the next Twitter, will never get to anywhere near that scale. Even if it is the goal (and plenty of sites manage to occupy a satisfying — and dare I say it, profitable — niche without needing a second datacentre full of equipment), a new site is only going to get that big by focusing on satisfying users and creating compelling applications.
Spending your time writing Yet Another Key-Value Store is an awesome way to spend a lazy weekend, but when you’re burning your rent cheque and credit rating trying to get your “next big thing” site off the ground, every minute spent not awesomising your user experience is putting you 35c closer to having to go back to working for The Man.
For whatever reason, though, it makes me uncomfortable to lie to people about things like this, even if I might think it’s in their interest. I know, first hand, the shock and pain that comes from finding out that your site, beloved by millions, is suddenly overloaded and unreliable — and, even worse, that throwing hardware at the problem won’t do a damned thing. It’s an awful feeling.
While you can’t be worrying about scaling to a million users when your site has a grand total of three users (one of which is your mum), you have to prepare for it when things starts to take off, and have a plan in place to deal with it.
Sooner or later, you’re going to have to sit down, find the pain points in your current architecture, and work out how to solve them. If you’re not comfortable doing that yourself, then you need smart systems people who know how. I can guarantee you that “the cloud” isn’t going to advise you on how to restructure your file storage so it will horizontally scale to a petabyte of data. Don’t rely on it to scale you out of trouble.
I have never in my life been asked, “How do porcupines make love?”. However, I know the answer very well: “very carefully”. In the same vein, when migrating the mass of data that makes up Github, you take your time and you work very, very carefully. Since this sort of migration doesn’t happen every day, and it’s not something you want to be learning on the job, I thought I’d write down my experiences for posterity.
As a big fan of automation, there wasn’t much chance that this whole thing wasn’t going to be scripted up the wazoo. We just need to copy the filesystem data across, dump the database and load it into the new site… and we’re done. Right?
HA! Not likely. To give you an idea of the scale of this thing, it took close to 24 hours just to do an rsync scan of the repository filesystems, without actually copying any data. Then there’s the database — the events table alone contained approximately 81.5 million records, which took a great many hours to dump from the live database during pre-migration work. It doesn’t take a great mathematician to realise that copying all this data over the Internet while the site was down for business wasn’t going to fly.
Initially, we were going to rely on the bandwidth of a station wagon full of tapes (or a couple of USB drives in a FedEx jet, anyway) to do the initial copying of data. However, due to some technical problems at the old facility, the “average transfer rate” wasn’t very high (the copy to disk took several weeks to complete), and we ended up kicking off a network-based initial sync of the repository data that finished less than half an hour after the drives were plugged into the machines at the new data centre. While I’m still a fan of shipping disks around for large-scale transfers, I won’t discount using the Internet to transfer such a large data set around so quickly next time.
Since a single real-time copy wasn’t practical, we’d have to look to incremental copying, where we pre-sync as much data as possible before the Big Cutover Day, and then only copy the latest changes while the site is down.
Thankfully, Github’s software design has pretty much all the hooks we needed to make this a straightforward task. For example, we didn’t have to dump the entire events table, because once a row is written it’s never changed — so we only need to dump events that were created since the last dump.
The system also keeps track of the last time a repository was changed, which means that we can ask the database for a list of repositories that have changed since the last sync, which makes for a very simple (and quick!) incremental sync. For a smaller data set we would just use rsync directly, but due to the performance limitations of the previous hosting environment, this took far, far too long to do with just rsync.
So, we can script everything, and there’s the ability to do repeated incremental syncs. What do these scripts look like?
Well, first up, there’s a lot of them. It was best to write separate scripts to synchronise each data set — one for the repositories, one for the events table, one for the rest of the database, one for gists, and so on. This meant that it was fairly trivial to develop these scripts in parallel, and they could be tested and run independently of each other.
Also, each task that had to be performed for a given data set was in its own script, so each step could be tested independently. For example, the repo sync job consisted of one script to collect the list of repos that needed resyncing and write that list to disk, another script to sync a single repository, and a third script to loop over all the repos listed by the first script and invoke the second script for each of them.
The other important properties of these scripts were:
Once all of these individual scripts had been written, tested, debugged, tested a few more times, and generally fretted over until our nails were chewed to the quick, it was time to assemble the master script. I’m not about to run a dozen scripts to migrate a site when one will suffice. This was particularly important in Github’s case because to minimise downtime we wanted to run several things in parallel, then wait until they’d all finished, then run the syncs that depended on the data we’d synced in the last lot, and so on. Our scripts looked a lot like this:
task1 >logs/task1.log 2>&1 &
task1_pid=${!}
task2 >logs/task2.log 2>&1 &
task2_pid=${!}
wait $task1_pid
wait $task2_pid
task3 >logs/task3.log 2>&1 &
task3_pid=${!}
task4 >logs/task4.log 2>&1 &
task4_pid=${!}
task5 >logs/task5.log 2>&1 &
task5_pid=${!}
wait $task3_pid
wait $task4_pid
wait $task5_pid
There was also a pile of “doing this, now doing this, now doing this” logging (with timestamps) that helped us to get a feel for how long the different parts would take, and where everything was up to.
When we actually performed the cutover, the “main” sync script was running for a total of 27 minutes. Given that we’d given ourselves an hour to get everything across, we were all quite pleased with this outcome.
Whilst all these scripts ran really well, and the background processes made everything run really fast, I must say it was a right pain in the butt to stop things mid-flight when it was necessary. Hitting Ctrl-C only stopped the foreground (controller) script, and all of the children that had been started in the background kept flying along.
Doing this again, I’d make sure all my scripts had traps on SIGINT that killed off all the child processes that they had spawned. In retrospect, this is just a variant of “one script to start everything” — you should only need to do one thing (Ctrl-C) to stop it all, as well.
Also, the timestamp files weren’t handled real well. If you did kill things off mid-run (or, heaven forbid, a script crashed out) then the timestamp files would be wrong, because we just did a straight touch at the beginning of the script. What would have been better would be something like this:
touch stamp.new do_all_the_work mv stamp stamp.prev mv stamp.new stamp
This would make sure that premature death would leave the stamp as-is, while still capturing the true start time of the job (which a simple touch at the end would fail to do).
Testing the new site before we let users at it, we found that creating gists wasn’t working right. It turned out that the database dumping script didn’t have the right set of options, and the schemas of the tables weren’t quite right (no autoincrements), and that was giving gist creation conniptions. Thankfully, the bug in the script was quickly spotted and the database dump was re-run. We even managed to get the second dump and load completed before our scheduled maintenance window was finished. If our scripts hadn’t
been broken down by data set, this resyncing process would have been made a whole lot harder because we wouldn’t have been able to easily run just the parts that needed to be redone.
Once we opened the floodgates of the new site, everything ran happily for a minute or two, and then ground to a halt. The whaaaaa? Poke, prod… hmm, the database is running a bit hotter than I’d expect… whoa! 1500 queries active, all against the events table, with the disks working so hard the heads nearly came out the sides of the cases. What’s going on here?
As it turns out, schema insanity had struck again — this time, some of the indexes on the events table had failed to come across. While we know what happened with the main database dump, this one is still a mystery. How did some of the indexes fail to materialise? We’ve gone over the dumps and can’t find how they got lost. We’re putting it down to yet another case of MySQL doing dumb things without telling anyone.
As a final small improvement to the migration process, the site was able to into a “read only” mode, so that users could still browse code and pull from repositories while we were migrating. This made the migration a lot less intrusive for users, because a lot of site functions still worked, especially those made by casual users (who would be less likely to know all about the time of the migration).
Here are a few things I’ll definitely do differently next time:
I wonder when we’ll get our next Github-scale migration…
Anchr 2.0 makes you want to reach out and touch it; hold it; feel it. Your Anchr 2.0 pulsates with a reassuring rhythm, like that of a heart, but made of silicone instead of striated cardiac muscle.
Anchr 2.0 responds.. it is alive. If you listen carefully you can hear its machinations, at speeds beyond the limits of human ken. Don’t Panic – this is normal, but a helpful voice is always close by when you need it.
Anchr 2.0 is not made, but created. Observe its perfect finish and seamless form. The dull blue glow of security, punctuated by the cerice of backups. Anchr 2.0 fits snugly in the hands. Firm, but also yielding, you cannot discern the boundary; that is the sensation of redundancy. It is comforting.
Anchr 2.0 is communal, it is shared. But! A duality of nature: There is one, but there are also many. That is your Anchr 2.0; there are many like it, but that one is yours.
Anchr 2.0 is… everything you love about webhosting, with less crap