<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Anchor Web Hosting Blog &#187; migration</title>
	<atom:link href="http://www.anchor.com.au/blog/tag/migration/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.anchor.com.au/blog</link>
	<description>A view into the Anchor Engineroom</description>
	<lastBuildDate>Thu, 29 Jul 2010 06:35:57 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Bringing the Mountain to Mohamed</title>
		<link>http://www.anchor.com.au/blog/2009/11/bringing-the-mountain-to-mohamed/</link>
		<comments>http://www.anchor.com.au/blog/2009/11/bringing-the-mountain-to-mohamed/#comments</comments>
		<pubDate>Thu, 12 Nov 2009 15:12:59 +0000</pubDate>
		<dc:creator>matt</dc:creator>
				<category><![CDATA[FTW]]></category>
		<category><![CDATA[automation]]></category>
		<category><![CDATA[github]]></category>
		<category><![CDATA[migration]]></category>
		<category><![CDATA[project starbug]]></category>
		<category><![CDATA[rsync]]></category>

		<guid isPermaLink="false">http://www.anchor.com.au/blog/?p=1341</guid>
		<description><![CDATA[I have never in my life been asked, &#8220;How do porcupines make love?&#8221;. However, I know the answer very well: &#8220;very carefully&#8221;. In the same vein, when migrating the mass of data that makes up Github, you take your time and you work very, very carefully. Since this sort of migration doesn&#8217;t happen every day, [...]]]></description>
			<content:encoded><![CDATA[<p>I have never in my life been asked, &#8220;How do porcupines make love?&#8221;. However, I know the answer very well: &#8220;very carefully&#8221;.  In the same vein, when migrating the mass of data that makes up Github, you take your time and you work very, <em>very</em> carefully.  Since this sort of migration doesn&#8217;t happen every day, and it&#8217;s not something you <em>want</em> to be learning on the job, I thought I&#8217;d write down my experiences for posterity.</p>
<h3>SCRIPT IT!</h3>
<p>As a big fan of automation, there wasn&#8217;t much chance that this whole thing wasn&#8217;t going to be scripted up the wazoo.  We just need to copy the filesystem data across, dump the database and load it into the new site&#8230; and we&#8217;re done.  Right?</p>
<p>HA!  Not likely.  To give you an idea of the scale of this thing, it took close to 24 hours just to do an rsync scan of the repository filesystems, <em>without</em> actually copying any data.  Then there&#8217;s the database &#8212; the events table alone contained approximately 81.5 million records, which took a great many hours to dump from the live database during pre-migration work.  It doesn&#8217;t take a great mathematician to realise that copying all this data over the Internet while the site was down for business wasn&#8217;t going to fly.</p>
<p>Initially, we were going to rely on the bandwidth of a station wagon full of tapes (or a couple of USB drives in a FedEx jet, anyway) to do the initial copying of data.  However, due to some technical problems at the old facility, the &#8220;average transfer rate&#8221; wasn&#8217;t very high (the copy to disk took several <em>weeks</em> to complete), and we ended up kicking off a network-based initial sync of the repository data that finished less than half an hour after the drives were plugged into the machines at the new data centre.  While I&#8217;m still a fan of shipping disks around for large-scale transfers, I won&#8217;t discount using the Internet to transfer such a large data set around so quickly next time.</p>
<h4>Incrementalism</h4>
<p>Since a single real-time copy wasn&#8217;t practical, we&#8217;d have to look to incremental copying, where we pre-sync as much data as possible before the Big Cutover Day, and then only copy the latest changes while the site is down.</p>
<p>Thankfully, Github&#8217;s software design has pretty much all the hooks we needed to make this a straightforward task.  For example, we didn&#8217;t have to dump the entire events table, because once a row is written it&#8217;s never changed &#8212; so we only need to dump events that were created since the last dump.</p>
<p>The system also keeps track of the last time a repository was changed, which means that we can ask the database for a list of repositories that have changed since the last sync, which makes for a very simple (and quick!) incremental sync.  For a smaller data set we would just use rsync directly, but due to the performance limitations of the previous hosting environment, this took far, <em>far</em> too long to do with just rsync.</p>
<p>So, we can script everything, and there&#8217;s the ability to do repeated incremental syncs.  What do these scripts look like?</p>
<p>Well, first up, there&#8217;s a lot of them.  It was best to write separate scripts to synchronise each data set &#8212; one for the repositories, one for the events table, one for the rest of the database, one for gists, and so on.  This meant that it was fairly trivial to develop these scripts in parallel, and they could be tested and run independently of each other.  </p>
<p>Also, each task that had to be performed for a given data set was in its own script, so each step could be tested independently.  For example, the repo sync job consisted of one script to collect the list of repos that needed resyncing and write that list to disk, another script to sync a single repository, and a third script to loop over all the repos listed by the first script and invoke the second script for each of them.</p>
<p>The other important properties of these scripts were:</p>
<ul>
<li>We relied heavily on multitasking to overcome bandwidth limitations from a single TCP stream.  When you&#8217;re copying data over high capacity links, your available transfer rate is constrained more by the round-trip time between the endpoints than the available bandwidth &#8212; the longer it takes for an ACK to get back to the sender, the slower your data will flow.  So, since we had eight filesystems to copy data from, we fired off eight parallel rsync processes as child processes of the individual scripts.</li>
<li>Each script kept track of what it was doing and what it had done, and tried to avoid doing the same work again.  The repository syncs kept track of the repositories that had already been copied by means of a timestamp file &#8212; when we did a sync, we touched a file and then used the mtime of that file (<tt>stat -c %Y</tt> ftw!) to determine the start time of the next sync. The events table was straightforward &#8212; before each dump, we just ask the destination table where it&#8217;s up to, and dump from there.  Even the &#8220;main&#8221; database, which we dumped in it&#8217;s entireity each time, was dumped to a file compressed with `gzip &#8211;rsyncable` before being rsync&#8217;d across, saving a good few minutes of network transfer time on each cycle.</li>
<li>If something went wrong during the sync, we knew about it immediately. We wired up a small SMS sending script to send us alerts if the script terminated improperly.  This saved us a lot of waiting and watching, because we knew that we&#8217;d be told when we had to take notice of what&#8217;s going on.</li>
<li>Everything was logged.  The stdout and stderr of all processes was captured, and the scripts wrote their own log entries to that stream as well as to a &#8220;summary&#8221; log, like this: <tt>echo $(date) processing repo $repo |tee -a $LOGFILE</tt>.  Any errors were tagged with a unique string and written in a machine-parseable format, so we could re-run any failed components of the sync to ensure that nobody was missed.</li>
<li>While there were typically several scripts that had to be run in an appropriate order to make a sync happen, there was always a single script that did everything that needed doing &#8212; we never had to run more than one command to get a given sync done.</li>
</ul>
<p>Once all of these individual scripts had been written, tested, debugged, tested a few more times, and generally fretted over until our nails were chewed to the quick, it was time to assemble the master script.  I&#8217;m not about to run a dozen scripts to migrate a site when one will suffice.  This was particularly important in Github&#8217;s case because to minimise downtime we wanted to run several things in parallel, then wait until they&#8217;d all finished, then run the syncs that depended on the data we&#8217;d synced in the last lot, and so on.  Our scripts looked a lot like this:</p>
<pre>
task1 &gt;logs/task1.log 2&gt;&amp;1 &amp;
task1_pid=${!}

task2 &gt;logs/task2.log 2&gt;&amp;1 &amp;
task2_pid=${!}

wait $task1_pid
wait $task2_pid

task3 &gt;logs/task3.log 2&gt;&amp;1 &amp;
task3_pid=${!}

task4 &gt;logs/task4.log 2&gt;&amp;1 &amp;
task4_pid=${!}

task5 &gt;logs/task5.log 2&gt;&amp;1 &amp;
task5_pid=${!}

wait $task3_pid
wait $task4_pid
wait $task5_pid
</pre>
<p>There was also a pile of &#8220;doing this, now doing this, now doing this&#8221; logging (with timestamps) that helped us to get a feel for how long the different parts would take, and where everything was up to.</p>
<p>When we actually performed the cutover, the &#8220;main&#8221; sync script was running for a total of 27 minutes. Given that we&#8217;d given ourselves an hour to get everything across, we were all quite pleased with this outcome.</p>
<h4>Putting on the brakes</h4>
<p>Whilst all these scripts ran really well, and the background processes made everything run really fast, I must say it was a right pain in the butt to stop things mid-flight when it was necessary.  Hitting Ctrl-C only stopped the foreground (controller) script, and all of the children that had been started in the background kept flying along.</p>
<p>Doing this again, I&#8217;d make sure all my scripts had traps on SIGINT that killed off all the child processes that they had spawned.  In retrospect, this is just a variant of &#8220;one script to start everything&#8221; &#8212; you should only need to do one thing (Ctrl-C) to stop it all, as well.</p>
<p>Also, the timestamp files weren&#8217;t handled real well.  If you did kill things off mid-run (or, heaven forbid, a script crashed out) then the timestamp files would be wrong, because we just did a straight touch at the beginning of the script.  What would have been better would be something like this:</p>
<pre>
touch stamp.new
do_all_the_work
mv stamp stamp.prev
mv stamp.new stamp
</pre>
<p>This would make sure that premature death would leave the stamp as-is, while still capturing the true start time of the job (which a simple touch at the end would fail to do).</p>
<h3>When Databases Attack</h3>
<p>Testing the new site before we let users at it, we found that creating gists wasn&#8217;t working right.  It turned out that the database dumping script didn&#8217;t have the right set of options, and the schemas of the tables weren&#8217;t quite right (no autoincrements), and that was giving gist creation conniptions.  Thankfully, the bug in the script was quickly spotted and the database dump was re-run.  We even managed to get the second dump and load completed before our scheduled maintenance window was finished.  If our scripts hadn&#8217;t<br />
been broken down by data set, this resyncing process would have been made a whole lot harder because we wouldn&#8217;t have been able to easily run <em>just</em> the parts that needed to be redone.</p>
<p>Once we opened the floodgates of the new site, everything ran happily for a minute or two, and then ground to a halt.  The whaaaaa?  Poke, prod&#8230; hmm, the database is running a bit hotter than I&#8217;d expect&#8230; whoa!  1500 queries active, all against the events table, with the disks working so hard the heads nearly came out the sides of the cases.  What&#8217;s going on here?</p>
<p>As it turns out, schema insanity had struck again &#8212; this time, <em>some</em> of the indexes on the events table had failed to come across.  While we know what happened with the main database dump, this one is still a mystery.  How did <em>some</em> of the indexes fail to materialise?  We&#8217;ve gone over the dumps and can&#8217;t find how they got lost.  We&#8217;re putting it down to yet another case of MySQL doing dumb things without telling anyone.</p>
<h3>Limiting the impact</h3>
<p>As a final small improvement to the migration process, the site was able to into a &#8220;read only&#8221; mode, so that users could still browse code and pull from repositories while we were migrating.  This made the migration a lot less intrusive for users, because a lot of site functions still worked, especially those made by casual users (who would be less likely to know all about the time of the migration).</p>
<h3>Lessons Learnt</h3>
<p>Here are a few things I&#8217;ll definitely do differently next time:</p>
<ul>
<li>Anywhere you&#8217;re depending on a third party to execute part of your migration, have a backup plan in case they can&#8217;t deliver &#8212; and know when you&#8217;ll have to execute your backup plan.  In our case, knowing exactly how long it would have taken to copy all the data over the Internet and then calculating back, we would have known to start copying over the network a few days earlier than we did. </li>
<li>Make sure that synchronisation scripts are as easy to stop as they are to start.</li>
<li>Verify the database schemas completely on the destination DB server by manual inspection, as well as dumping them and comparing to what&#8217;s on the source DB server.</li>
</ul>
<p>I wonder when we&#8217;ll get our next Github-scale migration&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.anchor.com.au/blog/2009/11/bringing-the-mountain-to-mohamed/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Pain-free server migration</title>
		<link>http://www.anchor.com.au/blog/2009/04/pain-free-server-migration/</link>
		<comments>http://www.anchor.com.au/blog/2009/04/pain-free-server-migration/#comments</comments>
		<pubDate>Thu, 09 Apr 2009 04:40:50 +0000</pubDate>
		<dc:creator>oliver</dc:creator>
				<category><![CDATA[FTW]]></category>
		<category><![CDATA[datacentre]]></category>
		<category><![CDATA[documentation]]></category>
		<category><![CDATA[migration]]></category>
		<category><![CDATA[painfree]]></category>
		<category><![CDATA[preparation]]></category>
		<category><![CDATA[review]]></category>
		<category><![CDATA[server]]></category>

		<guid isPermaLink="false">http://www.anchor.com.au/blog/?p=786</guid>
		<description><![CDATA[Being the veteran of a datacentre migration and several whole server migrations I feel like I&#8217;m getting the process down to a reasonably fine art. I had to perform another migration last night from another datacentre to ours at Global Switch and the process went very smoothly so I thought I&#8217;d share some of the [...]]]></description>
			<content:encoded><![CDATA[<p>Being the veteran of a datacentre migration and several whole server migrations I feel like I&#8217;m getting the process down to a reasonably fine art. I had to perform another migration last night from another datacentre to ours at Global Switch and the process went very smoothly so I thought I&#8217;d share some of the techniques I&#8217;ve built up over time so you might benefit if you&#8217;re in the same situation.</p>
<h3>Preparation</h3>
<p>This should go without saying. The more time you have to prepare for the migration, the better. You <strong>do not </strong>want to leave it until the last minute. My philosophy when approaching the migration is always to leave the least amount of work possible to do at the time of the actual migration. Clients will generally want to schedule any server downtime for late at night, when you are not going to be operating at your best (despite how many coffees or energy drinks you may have consumed). If you can log in to the machine, run a prepared script which takes of everything and have the migration completed for you, you will end up with a happy client and be happy yourself. You will be in the datacentre for less time and get to bed earlier, both of which are good things.</p>
<h3>Make good use of scripting</h3>
<p>Following on from my last point, I strongly encourage you to script as much as possible. The migration I just performed entailed moving a server from one datacentre and network provider to another which meant a change in address space. Thus, firewalls, IP address configuration files, Apache vhosts, ACLs and more had to change. Ahead of time I determined which files would need to be modified and created a script which took a backup of each of these files before overwriting them with corrected versions. Any failure would cause the script to stop and print the problem which could be easily diagnosed manually.</p>
<p>The more automation and failsafes you can build into your script, the better. Since you will be creating it with plenty of time up your sleeve and your brain operating at full capacity you can build up the script with your full arsenal of tricks. At 3am in a cold datacentre with noisy airconditioning you can hardly expect to have your full faculties with you, so make life easier for yourself by leaving as little actual work to do at this point.</p>
<h3>Fully acquaint yourself with the server</h3>
<p>You will only know what needs to be changed on the server if you are familiar with it. Of course, you should have plenty of good documentation already on it but if not, log in and get the lie of the land. Have a plan for how you will find out facts about the system &#8211; make use of grep and well structured regexes for finding out configuration details, slocate (if there is a locate database present) for finding critical files, and your usual toolkit of sysadmin techniques.</p>
<h3>Document as you go</h3>
<p>At Anchor, documentation is critically important. We have an internal wiki system in which we make detailed notes on every server and a great number of technical articles (a lot of which we have shared with you in our <a href="http://www.anchor.com.au/hosting/">public wiki</a>). Every migration plan is carefully documented from start to finish. In more complicated scenarios a full change proposal is created and officially ratified, but at the very least you should create a checklist:</p>
<ul>
<li>people involved (and their contact details, if necessary)</li>
<li>time frame</li>
<li>a detailed list of items that need to be prepared or information that needs to be acquired before the migration takes place</li>
<li>actions that will be undertaken just before the migration starts</li>
<li>the list of actual migration steps, including details of what any scripts will be doing</li>
<li>post-migration actions which need to be done immediately after the migration &#8211; e.g. checking that all your monitoring is showing OK for all hosts and services</li>
<li>a list of &#8220;cleanup&#8221; items which can be completed after the migration, but not time critical, e.g. removing stale references to servers from your internal documentation</li>
</ul>
<p>Have as many people check over your documentation as possible, preferably those who have knowledge of the systems so that they can find anything you have missed. The more eyes on your documentation, and heads thinking about it, the better the chances that you will have a plan that covers all aspects.</p>
<p>One of the most important things from my point of view with documentation is to forward a copy to the client, and keep them involved in the process. Not only does it give them confidence in your abilities to conduct the migration successfully, but it gives them an idea of the work that you have had to put in, gives transparency to the process and gives you another point of view on the migration &#8211; there may be other steps important to them which you may have missed for example lowering TTLs on domains that are solely client-controlled.</p>
<h3>Keep the client &#8220;in the loop&#8221;</h3>
<p>Following on from my previous point, as well as giving the client a copy of your migration documentation, it is important to let them know what is going on. Send them a courtesy email every day or two, a call or whatever your deem appropriate to let them know how you are going with preparations and any information you need from them.</p>
<p>On the day of the migration, double-check everything with them &#8211; times, contact details, the migration plan, and so on. Make sure they are still happy to go ahead and that they are happy with your plans. Give them a courtesy call or message when you are about to start the migration, when you are finishing, but most importantly whenever you have any unexpected problems. Nothing upsets clients more than having things go pear-shaped and not being informed about it. Even if you don&#8217;t know what the problem is, let them know that you are diligently working on it and will keep them up to date with developments.</p>
<h3>Plan for when things go wrong</h3>
<p>In a perfect world, you would prepare adequately and everything would go flawlessly (as it did for me last night, luckily). However every slightly obsessive-compulsive systems administrator knows that things can and will go wrong every now and then despite your best efforts.</p>
<p>Make an escape plan for every point where things can go wrong during the migration. Given you won&#8217;t have infinite time available, prepare most for the most likely failure scenarios. Make a rollback plan which will abort the migration, and decide how many failures will cause you to take this rollback plan on the night. Confirm this with the client.</p>
<p>Make sure that no change you make cannot be reverted (which most times will necessitate backups). There is nothing worse than discovering you have irrevocably destroyed data in the process of making a critical change.</p>
<h3>Approach everything with an obsessive-compulsive attitude</h3>
<p>The best plans will have considered everything and left no detail to chance. It can be tiresome to be painstakingly thorough in your plans, but ultimately it will pay off. At the same time though, you don&#8217;t have to do everything in one sitting &#8211; make notes in your migration plan on what you still need to do and follow it up later. Don&#8217;t foolishly believe you will remember everything on the migration day, or even an hour from now &#8211; WRITE IT DOWN!</p>
<p>Remember, even though the preparation may be slightly tiresome, you are just making life easier for yourself at migration time. Hopefully if you follow these general tips I&#8217;ve prepared, they will make your next migration a lot easier.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.anchor.com.au/blog/2009/04/pain-free-server-migration/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
