<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Anchor Web Hosting Blog &#187; project starbug</title>
	<atom:link href="http://www.anchor.com.au/blog/tag/project-starbug/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.anchor.com.au/blog</link>
	<description>A view into the Anchor Engineroom</description>
	<lastBuildDate>Wed, 08 Feb 2012 00:51:36 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Bringing the Mountain to Mohamed</title>
		<link>http://www.anchor.com.au/blog/2009/11/bringing-the-mountain-to-mohamed/</link>
		<comments>http://www.anchor.com.au/blog/2009/11/bringing-the-mountain-to-mohamed/#comments</comments>
		<pubDate>Thu, 12 Nov 2009 15:12:59 +0000</pubDate>
		<dc:creator>matt</dc:creator>
				<category><![CDATA[FTW]]></category>
		<category><![CDATA[automation]]></category>
		<category><![CDATA[github]]></category>
		<category><![CDATA[migration]]></category>
		<category><![CDATA[project starbug]]></category>
		<category><![CDATA[rsync]]></category>

		<guid isPermaLink="false">http://www.anchor.com.au/blog/?p=1341</guid>
		<description><![CDATA[I have never in my life been asked, &#8220;How do porcupines make love?&#8221;. However, I know the answer very well: &#8220;very carefully&#8221;. In the same vein, when migrating the mass of data that makes up Github, you take your time and you work very, very carefully. Since this sort of migration doesn&#8217;t happen every day, [...]]]></description>
			<content:encoded><![CDATA[<p>I have never in my life been asked, &#8220;How do porcupines make love?&#8221;. However, I know the answer very well: &#8220;very carefully&#8221;.  In the same vein, when migrating the mass of data that makes up Github, you take your time and you work very, <em>very</em> carefully.  Since this sort of migration doesn&#8217;t happen every day, and it&#8217;s not something you <em>want</em> to be learning on the job, I thought I&#8217;d write down my experiences for posterity.</p>
<h3>SCRIPT IT!</h3>
<p>As a big fan of automation, there wasn&#8217;t much chance that this whole thing wasn&#8217;t going to be scripted up the wazoo.  We just need to copy the filesystem data across, dump the database and load it into the new site&#8230; and we&#8217;re done.  Right?</p>
<p>HA!  Not likely.  To give you an idea of the scale of this thing, it took close to 24 hours just to do an rsync scan of the repository filesystems, <em>without</em> actually copying any data.  Then there&#8217;s the database &#8212; the events table alone contained approximately 81.5 million records, which took a great many hours to dump from the live database during pre-migration work.  It doesn&#8217;t take a great mathematician to realise that copying all this data over the Internet while the site was down for business wasn&#8217;t going to fly.</p>
<p>Initially, we were going to rely on the bandwidth of a station wagon full of tapes (or a couple of USB drives in a FedEx jet, anyway) to do the initial copying of data.  However, due to some technical problems at the old facility, the &#8220;average transfer rate&#8221; wasn&#8217;t very high (the copy to disk took several <em>weeks</em> to complete), and we ended up kicking off a network-based initial sync of the repository data that finished less than half an hour after the drives were plugged into the machines at the new data centre.  While I&#8217;m still a fan of shipping disks around for large-scale transfers, I won&#8217;t discount using the Internet to transfer such a large data set around so quickly next time.</p>
<h4>Incrementalism</h4>
<p>Since a single real-time copy wasn&#8217;t practical, we&#8217;d have to look to incremental copying, where we pre-sync as much data as possible before the Big Cutover Day, and then only copy the latest changes while the site is down.</p>
<p>Thankfully, Github&#8217;s software design has pretty much all the hooks we needed to make this a straightforward task.  For example, we didn&#8217;t have to dump the entire events table, because once a row is written it&#8217;s never changed &#8212; so we only need to dump events that were created since the last dump.</p>
<p>The system also keeps track of the last time a repository was changed, which means that we can ask the database for a list of repositories that have changed since the last sync, which makes for a very simple (and quick!) incremental sync.  For a smaller data set we would just use rsync directly, but due to the performance limitations of the previous hosting environment, this took far, <em>far</em> too long to do with just rsync.</p>
<p>So, we can script everything, and there&#8217;s the ability to do repeated incremental syncs.  What do these scripts look like?</p>
<p>Well, first up, there&#8217;s a lot of them.  It was best to write separate scripts to synchronise each data set &#8212; one for the repositories, one for the events table, one for the rest of the database, one for gists, and so on.  This meant that it was fairly trivial to develop these scripts in parallel, and they could be tested and run independently of each other.  </p>
<p>Also, each task that had to be performed for a given data set was in its own script, so each step could be tested independently.  For example, the repo sync job consisted of one script to collect the list of repos that needed resyncing and write that list to disk, another script to sync a single repository, and a third script to loop over all the repos listed by the first script and invoke the second script for each of them.</p>
<p>The other important properties of these scripts were:</p>
<ul>
<li>We relied heavily on multitasking to overcome bandwidth limitations from a single TCP stream.  When you&#8217;re copying data over high capacity links, your available transfer rate is constrained more by the round-trip time between the endpoints than the available bandwidth &#8212; the longer it takes for an ACK to get back to the sender, the slower your data will flow.  So, since we had eight filesystems to copy data from, we fired off eight parallel rsync processes as child processes of the individual scripts.</li>
<li>Each script kept track of what it was doing and what it had done, and tried to avoid doing the same work again.  The repository syncs kept track of the repositories that had already been copied by means of a timestamp file &#8212; when we did a sync, we touched a file and then used the mtime of that file (<tt>stat -c %Y</tt> ftw!) to determine the start time of the next sync. The events table was straightforward &#8212; before each dump, we just ask the destination table where it&#8217;s up to, and dump from there.  Even the &#8220;main&#8221; database, which we dumped in it&#8217;s entireity each time, was dumped to a file compressed with `gzip &#8211;rsyncable` before being rsync&#8217;d across, saving a good few minutes of network transfer time on each cycle.</li>
<li>If something went wrong during the sync, we knew about it immediately. We wired up a small SMS sending script to send us alerts if the script terminated improperly.  This saved us a lot of waiting and watching, because we knew that we&#8217;d be told when we had to take notice of what&#8217;s going on.</li>
<li>Everything was logged.  The stdout and stderr of all processes was captured, and the scripts wrote their own log entries to that stream as well as to a &#8220;summary&#8221; log, like this: <tt>echo $(date) processing repo $repo |tee -a $LOGFILE</tt>.  Any errors were tagged with a unique string and written in a machine-parseable format, so we could re-run any failed components of the sync to ensure that nobody was missed.</li>
<li>While there were typically several scripts that had to be run in an appropriate order to make a sync happen, there was always a single script that did everything that needed doing &#8212; we never had to run more than one command to get a given sync done.</li>
</ul>
<p>Once all of these individual scripts had been written, tested, debugged, tested a few more times, and generally fretted over until our nails were chewed to the quick, it was time to assemble the master script.  I&#8217;m not about to run a dozen scripts to migrate a site when one will suffice.  This was particularly important in Github&#8217;s case because to minimise downtime we wanted to run several things in parallel, then wait until they&#8217;d all finished, then run the syncs that depended on the data we&#8217;d synced in the last lot, and so on.  Our scripts looked a lot like this:</p>
<pre>
task1 &gt;logs/task1.log 2&gt;&amp;1 &amp;
task1_pid=${!}

task2 &gt;logs/task2.log 2&gt;&amp;1 &amp;
task2_pid=${!}

wait $task1_pid
wait $task2_pid

task3 &gt;logs/task3.log 2&gt;&amp;1 &amp;
task3_pid=${!}

task4 &gt;logs/task4.log 2&gt;&amp;1 &amp;
task4_pid=${!}

task5 &gt;logs/task5.log 2&gt;&amp;1 &amp;
task5_pid=${!}

wait $task3_pid
wait $task4_pid
wait $task5_pid
</pre>
<p>There was also a pile of &#8220;doing this, now doing this, now doing this&#8221; logging (with timestamps) that helped us to get a feel for how long the different parts would take, and where everything was up to.</p>
<p>When we actually performed the cutover, the &#8220;main&#8221; sync script was running for a total of 27 minutes. Given that we&#8217;d given ourselves an hour to get everything across, we were all quite pleased with this outcome.</p>
<h4>Putting on the brakes</h4>
<p>Whilst all these scripts ran really well, and the background processes made everything run really fast, I must say it was a right pain in the butt to stop things mid-flight when it was necessary.  Hitting Ctrl-C only stopped the foreground (controller) script, and all of the children that had been started in the background kept flying along.</p>
<p>Doing this again, I&#8217;d make sure all my scripts had traps on SIGINT that killed off all the child processes that they had spawned.  In retrospect, this is just a variant of &#8220;one script to start everything&#8221; &#8212; you should only need to do one thing (Ctrl-C) to stop it all, as well.</p>
<p>Also, the timestamp files weren&#8217;t handled real well.  If you did kill things off mid-run (or, heaven forbid, a script crashed out) then the timestamp files would be wrong, because we just did a straight touch at the beginning of the script.  What would have been better would be something like this:</p>
<pre>
touch stamp.new
do_all_the_work
mv stamp stamp.prev
mv stamp.new stamp
</pre>
<p>This would make sure that premature death would leave the stamp as-is, while still capturing the true start time of the job (which a simple touch at the end would fail to do).</p>
<h3>When Databases Attack</h3>
<p>Testing the new site before we let users at it, we found that creating gists wasn&#8217;t working right.  It turned out that the database dumping script didn&#8217;t have the right set of options, and the schemas of the tables weren&#8217;t quite right (no autoincrements), and that was giving gist creation conniptions.  Thankfully, the bug in the script was quickly spotted and the database dump was re-run.  We even managed to get the second dump and load completed before our scheduled maintenance window was finished.  If our scripts hadn&#8217;t<br />
been broken down by data set, this resyncing process would have been made a whole lot harder because we wouldn&#8217;t have been able to easily run <em>just</em> the parts that needed to be redone.</p>
<p>Once we opened the floodgates of the new site, everything ran happily for a minute or two, and then ground to a halt.  The whaaaaa?  Poke, prod&#8230; hmm, the database is running a bit hotter than I&#8217;d expect&#8230; whoa!  1500 queries active, all against the events table, with the disks working so hard the heads nearly came out the sides of the cases.  What&#8217;s going on here?</p>
<p>As it turns out, schema insanity had struck again &#8212; this time, <em>some</em> of the indexes on the events table had failed to come across.  While we know what happened with the main database dump, this one is still a mystery.  How did <em>some</em> of the indexes fail to materialise?  We&#8217;ve gone over the dumps and can&#8217;t find how they got lost.  We&#8217;re putting it down to yet another case of MySQL doing dumb things without telling anyone.</p>
<h3>Limiting the impact</h3>
<p>As a final small improvement to the migration process, the site was able to into a &#8220;read only&#8221; mode, so that users could still browse code and pull from repositories while we were migrating.  This made the migration a lot less intrusive for users, because a lot of site functions still worked, especially those made by casual users (who would be less likely to know all about the time of the migration).</p>
<h3>Lessons Learnt</h3>
<p>Here are a few things I&#8217;ll definitely do differently next time:</p>
<ul>
<li>Anywhere you&#8217;re depending on a third party to execute part of your migration, have a backup plan in case they can&#8217;t deliver &#8212; and know when you&#8217;ll have to execute your backup plan.  In our case, knowing exactly how long it would have taken to copy all the data over the Internet and then calculating back, we would have known to start copying over the network a few days earlier than we did. </li>
<li>Make sure that synchronisation scripts are as easy to stop as they are to start.</li>
<li>Verify the database schemas completely on the destination DB server by manual inspection, as well as dumping them and comparing to what&#8217;s on the source DB server.</li>
</ul>
<p>I wonder when we&#8217;ll get our next Github-scale migration&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.anchor.com.au/blog/2009/11/bringing-the-mountain-to-mohamed/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Load balancing at Github: Why ldirectord?</title>
		<link>http://www.anchor.com.au/blog/2009/10/load-balancing-at-github-why-ldirectord/</link>
		<comments>http://www.anchor.com.au/blog/2009/10/load-balancing-at-github-why-ldirectord/#comments</comments>
		<pubDate>Sat, 31 Oct 2009 04:18:47 +0000</pubDate>
		<dc:creator>matt</dc:creator>
				<category><![CDATA[FTW]]></category>
		<category><![CDATA[github]]></category>
		<category><![CDATA[haproxy]]></category>
		<category><![CDATA[keepalived]]></category>
		<category><![CDATA[ldirector]]></category>
		<category><![CDATA[load balance]]></category>
		<category><![CDATA[project starbug]]></category>

		<guid isPermaLink="false">http://www.anchor.com.au/blog/?p=1349</guid>
		<description><![CDATA[Some comments on Github&#8217;s blog post &#8220;How We Made Github Fast&#8221; have been asking about why ldirectord was chosen as the load balancer for the new site. Since I made most of the architecture decisions for the Github project, it&#8217;s probably easiest if I answer that question directly here, rather than in a comment. Why [...]]]></description>
			<content:encoded><![CDATA[<p>Some comments on Github&#8217;s blog post &#8220;<a href="http://github.com/blog/530-how-we-made-github-fast">How We Made Github Fast</a>&#8221; have been asking about why ldirectord was chosen as the load balancer for the new site.  Since I made most of the architecture decisions for the Github project, it&#8217;s probably easiest if I answer that question directly here, rather than in a comment.</p>
<h3>Why ldirectord rocks</h3>
<p>The reasons for Github using ldirectord are fairly straightforward:</p>
<ul>
<li><b>I have a lot of experience with ldirectord</b>.  Never underestimate the value of knowing where the bodies are buried.  In ldirectord&#8217;s case, there aren&#8217;t many skeletons, but &#8220;better the devil you know&#8221; is a valid argument.  If you&#8217;ve got strong experience in making something work (and you&#8217;ve managed to make it work), and you don&#8217;t have a lot of time for science experiments, then there&#8217;s a lot to be said for going with what you know.
<p>This goes beyond simply knowing what to do when things go wrong, of course.  You&#8217;ll also know how to install and configure it already, how to monitor it, and so on.</p>
<p>What&#8217;s more, in ldirectord&#8217;s case I had already proven that it worked in an architecture almost identical to Github&#8217;s, and with a similar load profile.  At a previous job, I had ldirectord serving a sustained aggregate of 2500 TCP connections per second on a 128MB Xen VM, passing to a large set of backends in a manner almost identical to Github.
</li>
<li><b>Anchor has a lot of experience with ldirectord</b>.  Whilst my experiences are one thing, there&#8217;s a lot more to building an infrastructure than just setting it up.  I like to take holidays as much as anyone, and so there was no point in using something that nobody else in the company had any experience with, if there was something else that we did all know about.
<p>Thankfully, ldirectord lined up nicely, since it&#8217;s what we use for our other load balancing setups (not setup by me, either &#8212; these were already in place before I arrived).  This meant that there was already a pile of documentation and knowledge amongst the sysadmin team about ldirectord and it&#8217;s quirks.  Also, being automation junkies, we already had Puppet dialled in to install and configure ldirectord, and we knew exactly how to monitor it.
</li>
<li><b>Ldirectord will do the job</b>.  With the prior experiences of myself and the rest of the Anchor team, we were confident that ldirectord would do the job, and at the end of the day that&#8217;s what really matters.
</li>
</ul>
<h3>The Alternatives</h3>
<p>It&#8217;s all well and good to say &#8220;we know it and it works&#8221;, but I&#8217;m not really expecting anyone to just read that and say &#8220;well, OK, I guess we&#8217;ll use ldirectord&#8221;.  In fact, if you apply the above criteria to your own situation, there&#8217;s every possibility that you&#8217;ll come up with a different answer &#8212; and if you&#8217;ve never setup a load balancer at all, then you&#8217;ve got no experiences to use to guide you.</p>
<p>So, here are the other load balancing options I&#8217;ve dealt with, and what I think of them.  This might give you a bit of food for thought when choosing your load balancer.</p>
<ul>
<li><b>keepalived</b>.  This is the project closest to ldirectord in terms of functionality and operation.  It actually uses the same load balancing &#8220;core&#8221; as ldirectord, <a href="http://www.linuxvirtualserver.org/software/ipvs.html">IPVS</a>, part of the <a href="http://www.linuxvirtualserver.org/">Linux Virtual Server</a> project.  As such, it performs similarly to ldirectord when it comes to actually redirecting requests to backends, and is another excellent choice for load balancing.
<p>For Github, though, there wasn&#8217;t any benefit in using keepalived.  Whilst I used keepalived extensively at my last job, nobody else in at Anchor had had much to do with it.  Also, keepalived has a built-in failover mechanism, which we didn&#8217;t need because we already use Heartbeat/Pacemaker for all our HA/failover requirements.  I also feel that keepalived is more complicated when compared directly to ldirectord, largely because of it&#8217;s built-in failover capabilities.  That&#8217;s not to say that combining Pacemaker and ldirectord is dirt simple, but if you&#8217;ve already got Pacemaker on hand anyway&#8230;</p>
<p>If <em>all</em> you needed was a HA load balancer, and had no experience with either ldirectord or keepalived, I&#8217;d probably recommend keepalived over ldirectord, as it&#8217;s one project and one piece of software to do everything you need.
</li>
<li><b>Load-balancing appliances</b>.  Sometimes misleadingly referred to as &#8220;hardware&#8221; load balancers (they&#8217;re still chock full of software, kids &#8212; and unlike high-end routers, I don&#8217;t know of any true L4 load balancer that has it&#8217;s forwarding plane entirely in hardware).
<p>I loathe these things.  They&#8217;re expensive, restrictive, slow, and generally cause you a lot more pain and suffering than they&#8217;re worth.  At my last job, one of my projects was to convert most of one of our existing clusters from a load-balancing appliance to use keepalived.  Why would we do this?  Because the $100k worth of appliance wasn&#8217;t capable of doing the job that $15k worth of commodity hardware and an installation of keepalived were handling with ease &#8212; and with capacity to spare.  That cluster was our smallest, too, with probably only 2/3 the capacity of the other clusters run by keepalived.</p>
<p>At the job where I had ldirectord handling 2500 conn/sec, we had also previously used a load-balancing appliance, which was supplied and managed by the hosting provider.  It was a management nightmare &#8212; we couldn&#8217;t get any useful statistics out of it at all, like the conn/sec coming in or going out, and we couldn&#8217;t usefully adjust the weightings of each backend (to tune how many connections were going to each different sort of machine) or manage the system in real-time.  When we switched to using ldirectord, a small shell script (involving <tt>watch</tt> and <tt>ipvsadm</tt>, mostly) was all it took for the CTO to be able to watch exactly how the cluster was performing, in real time, throughout the day.  He loved the visibility &#8212; and the fact that we were saving several hundred dollars a month didn&#8217;t hurt, either.
</li>
<li><b>haproxy</b>.  While we use haproxy extensively within Github, I don&#8217;t think haproxy is the right solution as the front-end load balancer for a high volume website.  Being a proxy, rather than a simple TCP connection redirector, it has much larger overheads in CPU and memory, and adds more latency to the connections.  All of Github&#8217;s load balancing is being done out of one small VM, and it barely raises a sweat.  The return traffic doesn&#8217;t even go back through the load balancer at Github, since we&#8217;re using a really neat mode of IPVS that allows the traffic to return to the client directly.  While you <em>can</em> throw hardware at the load balancing problem, I still prefer to be efficient where possible.
<p>Since haproxy makes a second TCP connection, rather than just redirecting an existing one, it mangles the source IP address information &#8212; and while you can work around that in HTTP with custom headers, that doesn&#8217;t work for other protocols like SSH.  I cringe at the thought of trying to defend against a DDoS attack when the most useful piece of diagnostic information (the source IP) can&#8217;t be correlated against the actions of an attacker on the site.</p>
<p>If all you know is haproxy, and you&#8217;re running a low-volume site that only has to deal with HTTP(S), then haproxy will probably do the job &#8212; it&#8217;s certainly handling more connections inside Github than most sites will ever see.  However, I&#8217;d recommend getting someone who does systems administration full-time (like us!) to install and manage a real load balancer like ldirectord rather than use haproxy, along with keeping your other basic infrastructure on track.  Wouldn&#8217;t you rather be developing new features rather than dealing with this stuff?
</li>
</ul>
<p>So, there&#8217;s one geek&#8217;s opinions on load balancing.  Questions and comments appreciated, and if you&#8217;d like to know more about any part of the Github architecture (or any other aspect of systems administration), please let us know in the comments and I&#8217;ll whip up some more blog posts.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.anchor.com.au/blog/2009/10/load-balancing-at-github-why-ldirectord/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>Virtualisation: It&#8217;s a Technology, not a Religion</title>
		<link>http://www.anchor.com.au/blog/2009/09/virtualisation-its-a-technology-not-a-religion/</link>
		<comments>http://www.anchor.com.au/blog/2009/09/virtualisation-its-a-technology-not-a-religion/#comments</comments>
		<pubDate>Tue, 29 Sep 2009 23:00:51 +0000</pubDate>
		<dc:creator>matt</dc:creator>
				<category><![CDATA[FTW]]></category>
		<category><![CDATA[github]]></category>
		<category><![CDATA[project starbug]]></category>
		<category><![CDATA[religion]]></category>
		<category><![CDATA[virtualisation]]></category>

		<guid isPermaLink="false">http://www.anchor.com.au/blog/?p=1201</guid>
		<description><![CDATA[It&#8217;s been interesting to look at the press coverage, blog posts, and tweets surrounding the move of Github to an Anchor-managed infrastructure &#8212; I&#8217;ve never worked on something so public before. I think the article about &#8220;Vampire Programmers&#8221; has been my favourite so far. The ZDnet article on the Github move gave me a wry [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s been interesting to look at the press coverage, blog posts, and tweets surrounding the move of Github to an Anchor-managed infrastructure &#8212; I&#8217;ve never worked on something so <em>public</em> before.  I think <a href="http://anthillonline.com/forget-india-silicon-valley-tech-firm-outsources-to-australia-thanks-to-vampire-programmers/">the article about &#8220;Vampire Programmers&#8221;</a> has been my favourite so far.</p>
<p>The <a href="http://www.zdnet.com.au/news/hardware/soa/GitHub-picks-Sydney-sysadmins/0,130061702,339298783,00.htm">ZDnet article on the Github move</a> gave me a wry chuckle, though.  It made it sound like the move signified some sort of rejection of the Church of the Hypervisor &#8212; that virtualisation had been tested and found wanting.  In actual fact, there&#8217;s <em>more</em> virtual machines running in the Github infrastructure now than there were previously, providing a lot of very essential services.</p>
<p>I really don&#8217;t think of myself as a virtualisation nay-sayer.  I started using virtual machines with User-Mode Linux, back before anyone outside of Cambridge had ever heard of Xen, and I got on board with Xen back in the 2.0 days.  I&#8217;ve introduced widespread virtualisation at two previous jobs, I was a big supporter of the use of virtualisation at my last job, and I&#8217;ve been working on Anchor&#8217;s High-Availability VM product recently.  Virtualisation hater I ain&#8217;t.</p>
<p>Conversely, though, I don&#8217;t think VMs are the answer to all the world&#8217;s problems.  They&#8217;re a fantastic opportunity for a lot of sites: everyone can be running on high-quality, server-grade hardware (redundant power, hardware RAID, fast busses, etc) without the need to either purchase or maintain that hardware.  Furthermore, each VM, by virtue of it&#8217;s isolation, is more easily managed and scaled independently of the other VMs.  Need more memory?  Allocate it.  This box is getting a little overloaded?  No problem, just move a VM to another piece of hardware.</p>
<p>The simple fact is that very, very few sites <em>need</em> a whole <a href="http://www.anchor.com.au/dedicated-hosting/dedicated-servers.py">dedicated server</a> &#8212; even an entry-level server is massive overkill for most sites.  In this situation, you can either:</p>
<ul>
<li>Spend the extra money, assuming that you&#8217;ll grow and recoup those costs;</li>
<li>Buy a cheaper machine, either a basic desktop machine or second-hand server, and take the hit in reliability;</li>
<li>Use <a href="http://www.anchor.com.au/web-hosting/website-hosting.py">shared hosting</a>, where everyone&#8217;s on the same OS installation (which has tradeoffs in control and isolation); or</li>
<li>Use a virtual machine.</li>
</ul>
<p>Unsurprisingly, I like the latter option.  It saves you money, avoids the reliability headaches of cheaper hardware and the management headaches of shared hosting.</p>
<p>Management is <em>the</em> big on-going cost of most sites.  Virtualisation simplifies that by isolating different sites and services from each other, so that when it comes time to scale them, it&#8217;s not a big job.  Most people who&#8217;ve been working as a developer or sysadmin will be able to recall the unpleasant feeling when that big-ball-of-wax that everyone calls &#8220;the server&#8221; starts to run out of huff, and there&#8217;s no better hardware to put it on, and no more software optimisation to be done.  The call goes out, &#8220;move some services to another server&#8221;.  Damn.</p>
<p>See, when everything&#8217;s on the one machine, they intertwine and become hard to separate.  That little hack that Roger The Talented Intern put in to make mail processing run faster?  That involved digging into the SMTP server queue and pulling out messages directly; if you separate the web server and the mail server, that&#8217;ll break &#8212; but I bet you don&#8217;t find that out until you move.</p>
<p>I hate doing archaeology on these sorts of machines, because it&#8217;s guaranteed that things will break, tempers will run hot, and sadness will result.  The cost of doing the move (in IT staff time, downtime, customer and staff dissatisfaction, and so on) can easily equal or exceed that of the hardware itself &#8212; and yes, I&#8217;m still talking about good-quality, server-grade hardware here.  People are expensive, and good people even more so.</p>
<p>Instead, if you run logically separate services in separate VMs, when the time comes to scale something, it really is a piece of cake to migrate a VM &#8212; shutdown, copy the disk image, boot it back up.  Piece of cake.  Sure, there&#8217;s some overhead in running those separate services in VMs, and yes, you&#8217;ll be looking to buy a second machine sooner than you would otherwise, but again, the savings made by <em>not</em> having to gently tease apart a dozen root-bound systems on a single machine will probably pay for that second machine.  Let&#8217;s not even consider the costs of another separation in two years time when the services you put onto that other machine need to be separated again&#8230;</p>
<p>This use of virtualisation is all well and dandy if you&#8217;re one of the vast majority of sites that don&#8217;t need to service 125,000 users and 2.5TB of filesystem data.  Github, though &#8212; they&#8217;re one of the (un)lucky few.  When you&#8217;re using a machine&#8217;s worth (or more) of processing power on a single service, there&#8217;s no benefit to virtualising that.  In Github&#8217;s case, there&#8217;s four physical machines running just the frontend services &#8212; each of which has the same specs as the machines that are running the VMs for the site.  Sticking the frontend services into VMs in that case would have been a fruitless move.  Similarly for the backend file storage, and the database.  They&#8217;re all single services consuming a machine&#8217;s worth (or more) of resources, so we give them physical machines.</p>
<p>Down the track, as Github grows and individual VMs work harder and need more resources, we&#8217;ll first increase the size of those VMs, before making the decision to move a power-hungry VM off onto it&#8217;s own physical hardware.  That&#8217;s an easy move &#8212; between the natural isolation provided by virtual machines and the strong configuration management policy we&#8217;ve adopted, transitioning from a VM to a physical machine will be painless &#8212; and painless systems management is, after all, the aim of the game.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.anchor.com.au/blog/2009/09/virtualisation-its-a-technology-not-a-religion/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>GitHub: Speed matters</title>
		<link>http://www.anchor.com.au/blog/2009/09/github-speed-matters/</link>
		<comments>http://www.anchor.com.au/blog/2009/09/github-speed-matters/#comments</comments>
		<pubDate>Tue, 29 Sep 2009 06:39:27 +0000</pubDate>
		<dc:creator>bsmith</dc:creator>
				<category><![CDATA[FTW]]></category>
		<category><![CDATA[drbd]]></category>
		<category><![CDATA[github]]></category>
		<category><![CDATA[moving]]></category>
		<category><![CDATA[project starbug]]></category>
		<category><![CDATA[site migration]]></category>
		<category><![CDATA[speed]]></category>
		<category><![CDATA[starbug]]></category>

		<guid isPermaLink="false">http://www.anchor.com.au/blog/?p=1161</guid>
		<description><![CDATA[Impressions from the first article (in its first day) and the first 24 hours of the GitHub migration, have caused us at Anchor to believe that; GitHub is just as popular as we thought, The migration was worth it, as things are running much faster (just check your twitter feeds, or better yet, check your [...]]]></description>
			<content:encoded><![CDATA[<p><em>Impressions from the first article (in its first day) and the first 24 hours of the GitHub migration, have caused us at Anchor to believe that; </em></p>
<ol>
<li><em>GitHub is just as popular as we thought, </em></li>
<li><em>The migration was worth it, as things are running much faster (just check your twitter feeds, or better yet, check your GitHub source tree for no reason <img src='http://www.anchor.com.au/blog/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' />  ); and,<strong> </strong></em></li>
<li><em>People are interested in what has gone under the hood of the new GitHub (insert your favorite fast car here; otherwise lets say a roadster). </em></li>
</ol>
<p><em>Taking these three things into account, this installment will discuss why things are so much faster post migration compared to prior.</em></p>
<p>I said &#8216;faster&#8217; and not &#8216;fast&#8217;, because GitHub is now as fast as any website should be. So in comparison, yes, GitHub is fast now, however it is akin to riding your bicycle with half inflated tires: when fully inflated, suddenly your old bike is blazing fast. Now this is not to be critical of the former architecture which held its merits when GitHub was founded. GitHub had simply moved to a stage where a infrastructure architecture refresh was logical.</p>
<p>The main thing, in the large, that made this new architecture fast was that we were given a blank slate and large amounts of freedom to make an architecture that would do the job well.  This is an incredibly rare thing, and it no doubt took a lot of courage on Github&#8217;s part.  For that, we have to say &#8220;thankyou&#8221; to the Github team for letting us have that freedom.  I like to think that we&#8217;ve repaid that trust with a pretty awesome architecture that will serve them well for some time to come.</p>
<p><strong>SCALE: </strong>When looking at the new architecture as a whole, the increased scale is immediately evident. GitHub now consumes far more hardware than ever before:</p>
<p><em>Old Infrastructure:</em></p>
<ul style="margin-top: 0px;margin-right: 0px;margin-bottom: 0px;margin-left: 1.25em;line-height: 1.4em;padding: 0px">
<li>10 VMs</li>
<li>39 VCPUs</li>
<li>54GB <span style="line-height: 1.4em;padding: 0px;margin: 0px">RAM</span></li>
</ul>
<p style="margin-top: 1em;margin-right: 0px;margin-bottom: 1em;margin-left: 0px;line-height: 1.4em;padding: 0px"><em>New Infrastructure:</em></p>
<ul style="margin-top: 0px;margin-right: 0px;margin-bottom: 0px;margin-left: 1.25em;line-height: 1.4em;padding: 0px">
<li>16 physical machines</li>
<li>128 physical cores</li>
<li>288GB <span style="line-height: 1.4em;padding: 0px;margin: 0px">RAM</span></li>
</ul>
<p>Or for those who enjoy visual cues:</p>
<p><img class="aligncenter size-full wp-image-1179" src="http://www.anchor.com.au/blog/wp-content/uploads/2009/09/Memory_Compare1.png" alt="Resource comparison old to new infrastructure" width="375" height="436" /></p>
<p>It is a credit to the old infrastructure and GitHub&#8217;s code that it ran so well on so little (in comparison). The first credit for increased performance is <strong>increased scale</strong>.</p>
<p>An important note regarding the hardware is that there is nothing special (or industry secretive) regarding it. The solution in its entirety is run from commodity hardware. No special black boxes doing scary things with packets and routes. No appliance servers. The solution architecture developed by Anchor can be used with any hardware vendor (insert: Dell, HP, IBM, SuperMicro, etc). Vendor neutrality provides GitHub with no encumbrance with either scaling up or out, a key issue when considering growth and future flexibility.</p>
<p><em>Note: The architectures flexibility allows for the user repository storage to be expanded with a mix of vendor hardware (should GitHub ever change hardware vendor). Furthermore, any component can be exchanged for another vendor&#8217;s hardware with no change to GitHubs architecture or software.</em></p>
<p>In a nutshell, the increased scale provides:</p>
<ul>
<li>More GitHub front-end servers to service your requests;</li>
<li>More storage; and</li>
<li>More I/O bandwidth when working with your repository data</li>
</ul>
<p><strong>HARDWARE PERFORMANCE:</strong> The speed specifications of the underlying components is important, in addition to how that hardware is utilised.</p>
<p><em>Storage I/O: </em>A common factor in poor performance with any solution is an <a href="http://www.anchor.com.au/hosting/development/HuntingThePerformanceWumpus#head-8f4521847d24e2119a421aa8d89a89d7e8372fdc">I/O bottleneck at the storage level</a>.  This pain was GitHub&#8217;s. To alleviate this, not only is the storage now distributed across several servers (distributing the I/O), but it is now running on direct-attached 15,000 RPM SAS disks on battery-backed hardware RAID. Therefore, the second credit for increased performance is <strong>faster storage</strong>.</p>
<p><em>Direct access to hardware: </em>Virtualisation is great. What isn&#8217;t great is when virtualisation is used as a universal solution. At Anchor we believe there is a place for virtualisation, and systems with massive I/O or CPU requirements is not that place. By moving resource heavy systems onto dedicated hardware, any contention for resources between individual VMs is removed. The third credit goes to <strong>less overhead</strong>.</p>
<p><strong>ARCHITECTURE:</strong> Throwing hardware at a scaling problem is an easy solution, but without the right division of resources and the right software to properly use it, it&#8217;s not going to run real fast.</p>
<p>For GitHub, this was their innovative Git command proxying systems, which do an excellent job of taking requests from the frontends (where users connect with their web browser, git client, or SSH client) and shipping them to the fileservers.  The database structure, filesystem layout, and code efficiency also contribute to this.</p>
<p>Given that the software isn&#8217;t our speciality, there&#8217;s not a lot for us to say about this, but Github are planning a series of posts on <a href="http://github.com/blog">their blog</a>, and I&#8217;m quite sure it&#8217;ll be enlightening.</p>
<p><strong>TO REVIEW</strong>: The factors involved in GitHub&#8217;s faster response on the new infrastructure include (but are not limited to):</p>
<ul>
<li>Increased Infrastructure (Scale)</li>
<li>Faster Hardware ( Storage)</li>
<li>No resource contention (More resources per server)</li>
<li>Solid, scalable architecture (Awesomeness)</li>
</ul>
<p><em>Keep an eye on this space, as we delve into technology specific posts regards what kinds of 11 herbs and spices Anchor used to realise the new GitHub architecture.</em></p>
<p><em><br />
</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.anchor.com.au/blog/2009/09/github-speed-matters/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>GitHub: Designing Success</title>
		<link>http://www.anchor.com.au/blog/2009/09/github-designing-success/</link>
		<comments>http://www.anchor.com.au/blog/2009/09/github-designing-success/#comments</comments>
		<pubDate>Mon, 28 Sep 2009 02:43:55 +0000</pubDate>
		<dc:creator>Davy Jones</dc:creator>
				<category><![CDATA[FTW]]></category>
		<category><![CDATA[github]]></category>
		<category><![CDATA[moving]]></category>
		<category><![CDATA[project starbug]]></category>
		<category><![CDATA[site migration]]></category>

		<guid isPermaLink="false">http://www.anchor.com.au/blog/?p=1125</guid>
		<description><![CDATA[At Anchor we do not believe in black box solutions.  Sharing is caring and we like to share. In this post we specifically want to share our triumph with Project StarBug, better known to the wider world as GitHub. For the uninitiated, GitHub is ‘Social Networking meets Source Code management’, or in GitHubs own words [...]]]></description>
			<content:encoded><![CDATA[<p>At Anchor we do not believe in black box solutions.  Sharing is caring and we like to share. In this post we specifically want to share our triumph with Project StarBug, better known to the wider world as GitHub. For the uninitiated, GitHub is ‘Social Networking meets Source Code management’, or in GitHubs own words ‘<em>Git is a fast, efficient, distributed version control system ideal for the collaborative development of software. GitHub is the easiest (and prettiest) way to participate in that collaboration: fork projects, send pull requests, monitor development, all with ease.</em>’.</p>
<p>Some readers may protest this point, stating that GitHub is hosted in the USA while Anchor is located in Australia. How then has Anchor architected, implemented and (going forwards) manage GitHub’s infrastructure with such a geographical encumbrance?</p>
<p>All will be revealed in a blog entry <span style="text-decoration: line-through;">in</span> <span style="text-decoration: line-through;">three</span> of many parts.</p>
<p><strong>Part 1: (This Post)</strong> Designing for success (Otherwise known as: Making GitHub&#8217;s dream a reality and nightmares a thing of the past)</p>
<p><strong>Part 2: </strong>Speed matters</p>
<p><strong>Part N:</strong> (To be announced)</p>
<p>For obvious reasons, we cannot expose GitHub&#8217;s architecture in full, however we are sharing some of the more interesting technologies/architecture we have implemented, and the rationale for doing so. Essentially what we have done to make GitHub&#8217;s dreams a reality.</p>
<p><strong>Geographical encumbrance</strong></p>
<p>It is a credit to GitHub’s management that they were willing to look the world over for the right team to support them. While they do not want to be harried by anything outside the GitHub application (i.e. Hardware, O/S, Management, etc), they still needed to ensure that the right company was employed to look after these components.</p>
<p><em>Why Anchor?</em> Anchor’s flexibility to manage a solution on third-party hosted hardware (anywhere in the world) and versatility in developing an architecture to suit this scenario were part of the rationale. Anchor’s reputation for needing to know how technology works (again, no black boxes) and then working out how to improve it was a major contribution.</p>
<p>Enough fluff, now to the meat;</p>
<p>One can imagine that the architecture required to support GitHub is complex mix. We won’t lie; there are many moving parts. Some of the key criteria for designing the solution included:</p>
<p><strong>Scalability</strong></p>
<p>GitHub states it growth as “<em>400 new users and 1000 new repositories every day</em>”. Post migration GitHub will be running on infrastructure spread across 15+ physical hosts/servers. It is essential that the infrastructure can grow with the user base, from 10’s  to 100’s of servers, without the need to re-architect everything. Without a doubt, growing without the associated pain is a major objective for GitHub as it moves forward.</p>
<p><strong><em>Interesting Note: </em></strong><em>GitHub&#8217;s new physical infrastructure (at migration) consists of:</em></p>
<ul>
<li><em>15+ physical servers</em></li>
<li><em>10+ virtual servers</em></li>
<li><em>128 physical processor cores</em></li>
<li><em>Over 288GBs RAM</em></li>
<li><em>1TB+ of storage</em></li>
</ul>
<p>GitHub&#8217;s software architecture is modular by nature and scalability friendly. Components outside the core software, however, were not as readably scalable. This has been achieved with the following improvements;</p>
<ul>
<li><em>Distributed Storage Architecture (with real-time slaves).</em> Distribution of GitHub’s source code repos across multiple partitions and multiple nodes (including redundant slaves) provided improvements in performance, scalability and reliability. By removing the limitation of using a single filesystem volume for storage, the issue of dealing with large scale storage has been avoided. New partitions can be rapidly added on demand with little to no fuss.</li>
</ul>
<p>The graphic below illustrates a simplified request to the distributed file storage repo:</p>
<div id="attachment_1142" class="wp-caption aligncenter" style="width: 560px"><a href="http://www.anchor.com.au/blog/wp-content/uploads/2009/09/GitHubStorageDist_Small.png"><img class="size-full wp-image-1142" title="GitHub Storage Distribution (Small)" src="http://www.anchor.com.au/blog/wp-content/uploads/2009/09/GitHubStorageDist_Small.png" alt="GitHub Repo Storage Distribution Illustration" width="550" height="446" /></a><p class="wp-caption-text">GitHub Distributed Repo Storage</p></div>
<ul>
<li><em> (Sensible) Virtualisation</em>. Previously, GitHub&#8217;s infrastructure was entirely virtualised. While virtualisation has its merits, there are reasons to avoid it. Services that aren&#8217;t I/O-heavy can be virtualised, while components with high I/O requirements are run on dedicated (“bare metal”) servers. For GitHub, this means file storage and databases are <strong>not</strong> virtualised. Otherwise, virtualisation is used to provide a mix of server consolidation, rapid deployment and service redundancy/HA.</li>
<li><em>Horizontal scalability (on-demand, via automated build infrastructure</em>). The ability to add additional components to the infrastructure in an automated fashion reduces scale-out time and removes user error from builds/configuration. In addition, this also turns the server build/deployment procedure into a measurable deliverable. Over time this can be review and improved (Thank you <a style="text-decoration: none; color: #002bb8; background-image: none; background-repeat: initial; background-attachment: initial; -webkit-background-clip: initial; -webkit-background-origin: initial; background-color: initial; background-position: initial initial;" title="W. Edwards Deming" href="http://en.wikipedia.org/wiki/W._Edwards_Deming">W. Edwards Deming</a>).</li>
</ul>
<p><strong> </strong></p>
<p><strong>Reliability</strong></p>
<p>As with most businesses, High Availability (or <em>business continuance</em>) is essential to a success. To achieve this a combination of DRBD, virtualisation, heartbeat and load balancing has been employed.</p>
<ul>
<li><em>Mirroring Data; DRBD is utilised for several purposes. </em></li>
</ul>
<ol>
<li>It is used to ensure the redundant (read: slave) storage partitions and nodes are in sync with the active counterparts.</li>
<li>DRBD is also key in providing HA functionality across the virtualised environment.</li>
</ol>
<p>Several Xen hosts are deployed with the following scenario; Server 1 runs VM A(active) B(active) C(offline DRBD mirrored) D(offline DRBD mirrored), and Server 2 runs VM A(offline DRBD mirrored) VM B(offline DRBD mirrored) VM D(active) VM E(active). This provides active failover if either of the virtualisation hosts fail.</p>
<p>The graphic below illustrates the replicated, highly-available storage architecture:</p>
<div id="attachment_1132" class="wp-caption aligncenter" style="width: 560px"><a href="http://www.anchor.com.au/blog/wp-content/uploads/2009/09/GitHubStorage_Small.png"><img class="size-full wp-image-1132" title="GitHub Storage Simplified Example (Small)" src="http://www.anchor.com.au/blog/wp-content/uploads/2009/09/GitHubStorage_Small.png" alt="GitHub Storage HA/Replication" width="550" height="446" /></a><p class="wp-caption-text">GitHub Storage HA/Replication</p></div>
<ul>
<li><em>Consistency;</em><strong> </strong>via automated builds and configuration management. With any horizontally-scaled solution, consistency amongst similar components is essential. One of the most notable achievements across the entire architecture is the complete integration of automated build infrastructure. A new/additional component of the solution can be rapidly built and added to the overall system regardless of the architecture (physical or virtual).</li>
<li><em>Redundancy; </em>A simple way to ensure greater uptime and lower the risk of service interruption is to introduce as much redundancy as possible. GitHub is a great example of this practice. Data links, Ethernet/switching, server and components all have a redundant twin ready to swing into action should the primary fail.</li>
</ul>
<p><strong>Conclusions</strong></p>
<p>The implementation of any new architecture for an already mature product is never easy. Anchor engineers have been working tirelessly with GitHub staff to ensure the any growing pains are transparent to the users. In the next entry, we will be sharing some of our insights in regard to migrating GitHub from their existing host and infrastructure to the new Anchor developed model. Until then, we hope you enjoy the new faster GitHub, more of the time (well, all/any of the time) than ever before.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.anchor.com.au/blog/2009/09/github-designing-success/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Just because you CAN, Doesn&#8217;t mean you SHOULD</title>
		<link>http://www.anchor.com.au/blog/2009/09/just-because-you-can-doesnt-mean-you-should/</link>
		<comments>http://www.anchor.com.au/blog/2009/09/just-because-you-can-doesnt-mean-you-should/#comments</comments>
		<pubDate>Thu, 24 Sep 2009 22:49:19 +0000</pubDate>
		<dc:creator>matt</dc:creator>
				<category><![CDATA[FTW]]></category>
		<category><![CDATA[hammer and nail syndrome]]></category>
		<category><![CDATA[heartbeat]]></category>
		<category><![CDATA[high availability]]></category>
		<category><![CDATA[pacemaker]]></category>
		<category><![CDATA[project starbug]]></category>

		<guid isPermaLink="false">http://www.anchor.com.au/blog/?p=1119</guid>
		<description><![CDATA[(Yeah, I&#8217;ve been really slack with the blog posts about Project Starbug, but unfortunately when the choice is between doing the cool stuff, and blogging about it, the blogging tends to lose. I am still planning on writing all about things when things die down. In the meantime&#8230;) Remember when you were a kid, and [...]]]></description>
			<content:encoded><![CDATA[<p><em>(Yeah, I&#8217;ve been really slack with the blog posts about Project Starbug, but unfortunately when the choice is between doing the cool stuff, and blogging about it, the blogging tends to lose.  I am still planning on writing all about things when things die down.  In the meantime&#8230;)</em></p>
<p>Remember when you were a kid, and every time you got a new toy you&#8217;d just have to play with it all the time?  That mentality doesn&#8217;t go away as you grow up, it just gets a little more sophisticated.  With new technologies, I&#8217;m still very much this way.  I remember when I first learnt about flex and bison &#8212; for the next six months or so, every programming problem I encountered just <em>had</em> to be solved with a minilanguage implemented in flex/bison.  I shudder to think that any of that code might still be out there&#8230;</p>
<p>Anyway, this week&#8217;s shiny new toy has been Heartbeat / Pacemaker.  I&#8217;ve played with it a fair bit in the past, but just in two-node (Heartbeat v1) clusters.  For Project Starbug, though, I&#8217;ve been taking it to new heights of awesome (multi-node, easily expandable HA VM clusters, for example).  So, of course, anywhere that a bit of high-availability might be good, I&#8217;ve laid it on thick.  With the Puppet manifests we&#8217;ve got for managing Pacemaker, it&#8217;s almost harder <em>not</em> to make something HA (seriously, our Pacemaker manifests are <em>awesome</em>).</p>
<p>Unfortunately, in a couple of places I kinda forgot that some services have their own ways of doing HA, and they&#8217;re generally superior to tying a service and an IP together and telling Pacemaker to go do it&#8217;s thing.  The two services that I&#8217;ve just converted back <em>away</em> from Heartbeat are NTP and DNS.  Yeah, that&#8217;s right &#8212; I setup pacemaker resources for our NTP server and DNS server, because I suffer from occasional bouts of acute &#8220;shiny toy syndrome&#8221;.  I&#8217;ve now recovered, having learnt my lesson (for now).</p>
]]></content:encoded>
			<wfw:commentRss>http://www.anchor.com.au/blog/2009/09/just-because-you-can-doesnt-mean-you-should/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Zen of Documentation Maintenance</title>
		<link>http://www.anchor.com.au/blog/2009/08/the-zen-of-documentation-maintenance/</link>
		<comments>http://www.anchor.com.au/blog/2009/08/the-zen-of-documentation-maintenance/#comments</comments>
		<pubDate>Thu, 06 Aug 2009 01:56:35 +0000</pubDate>
		<dc:creator>matt</dc:creator>
				<category><![CDATA[FTW]]></category>
		<category><![CDATA[documentation]]></category>
		<category><![CDATA[project starbug]]></category>
		<category><![CDATA[system-maintenance]]></category>
		<category><![CDATA[wiki]]></category>

		<guid isPermaLink="false">http://www.anchor.com.au/blog/?p=1058</guid>
		<description><![CDATA[Given that you&#8217;ve been suddenly and completely convinced of the need for documentation in my previous post, the question still remains: how does one make documentation appear on a consistent and ongoing basis? If you&#8217;re really, really lucky, you&#8217;ve been spared the painful experience of putting up a wiki somewhere (or, worse, forked out a [...]]]></description>
			<content:encoded><![CDATA[<p>Given that you&#8217;ve been suddenly and completely convinced of the need for documentation in <a href="/blog/?p=1052">my previous post</a>, the question still remains: how does one make documentation appear on a consistent and ongoing basis?</p>
<p>If you&#8217;re really, really lucky, you&#8217;ve been spared the painful experience of putting up a wiki somewhere (or, worse, forked out a pile of cash for a &#8220;knowledge management system&#8221;), sticking some info into it at random, and then&#8230; nothing.  You planted the seeds of a documentation tree, why isn&#8217;t it growing, and flowering, and solving all of your problems for now and forever?</p>
<p>For Project Starbug, we&#8217;re creating a whole new infrastructure, more-or-less from scratch.  This is the easiest possible environment to make work, because you&#8217;re not constrained by what is already in place (and that you can&#8217;t afford to get rid of), and the whole thing isn&#8217;t in production so there&#8217;s no need to get freaked out by the thought of taking a major site off the Internet due to making an ill-advised change &#8212; and, most relevantly to this discussion, there&#8217;s no giant mass of undocumented&#8230; <em>stuff</em> that needs to be picked apart and documented.  There&#8217;s nothing more deadly to motivation than the idea that when you&#8217;ve got <em>this</em> bit documented, there&#8217;s only 350,000 other bits to go.</p>
<p>So, if I didn&#8217;t want to end up with a shiny, new, incomprehensible and undocumented system, we needed to start focusing on documentation right off the bat and build the documentation alongside the rest of the system.  This, in turn, meant that we needed to have something easy to work with, well structured, and above all <em>ready to go</em> before anything else could really kick off.</p>
<p>What to use was a no-brainer.  Wikis are straightforward to access and edit, and there&#8217;s very little downside to them.  We use moin internally for our documentation extensively, so it wasn&#8217;t a hard sell to spin up another copy of the wiki software to contain all of the documentation for this project. Most widely-used wiki engines these days are on much the same level, though, and it&#8217;s really just a matter of preference which one to use &#8212; mostly based around the language you&#8217;re most comfortable using (Python == Moin, PHP == MediaWiki, Perl == twiki, Ruby == instiki, Java == something useless and enterprisey), because you <em>really</em> want to be able to write plugins and extensions.  One day I&#8217;d love to try ikiwiki, because that means I can edit wiki pages without even needing to open my web browser, which will be a particularly special kind of bliss.</p>
<p>Why did we use a <em>separate</em> wiki, though, and not an extension of our existing one?  We want to communicate with the customer as well as we possibly can, and the content of the wiki is like a big, persistent communications nexus, and giving the customer (especially <em>this</em> customer, who really knows their stuff) direct access to be able to read all the internal procedures and technical information relating to the management of their infrastructure is a massive boon to communication.  Who knows when they might see something we&#8217;ve written and say, &#8220;Hey, that&#8217;s not right!&#8221; and fix it?  We&#8217;re the system administration experts, not the experts in their application, so it makes perfect sense to have them as tightly integrated as possible into the management of the whole infrastructure.</p>
<p>Though we may have made it over &#8220;Documentation Hurdle #1&#8243;, the race had barely even begun.  Plenty of well-intentioned doc projects have gotten something started, and then withered on the vine.  The key is to make sure that the documentation stays maintained, and keeping up with the growth of the infrastructure and it&#8217;s constant changes.  The most important way to do this is to identify the reasons why people <em>don&#8217;t</em> keep a reasonably useable documentation repository maintained, and remove those reasons, leaving no possible excuse not to write docs.  It needs to be <em>easier</em> to write docs than to not write them, otherwise they&#8217;ll get forgotten in the pressure of the moment, and playing catch-up is painful and annoying.</p>
<p>In the next article, we&#8217;ll examine why people don&#8217;t write docs as often as they know they should, and how to create a &#8220;documentation culture&#8221; in your team.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.anchor.com.au/blog/2009/08/the-zen-of-documentation-maintenance/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Know Thy Enemy</title>
		<link>http://www.anchor.com.au/blog/2009/08/know-thy-enemy/</link>
		<comments>http://www.anchor.com.au/blog/2009/08/know-thy-enemy/#comments</comments>
		<pubDate>Mon, 03 Aug 2009 22:07:49 +0000</pubDate>
		<dc:creator>matt</dc:creator>
				<category><![CDATA[FTW]]></category>
		<category><![CDATA[capacity planning]]></category>
		<category><![CDATA[infrastructure development]]></category>
		<category><![CDATA[project starbug]]></category>
		<category><![CDATA[server deployment]]></category>

		<guid isPermaLink="false">http://www.anchor.com.au/blog/?p=1042</guid>
		<description><![CDATA[Long before any code gets written or any servers deployed, a quiet yet crucial job is being performed. The poor tech who is doing this work won&#8217;t get much credit, and almost certainly none of the glory, but if this job isn&#8217;t done properly, then none of what gets done later will be of much [...]]]></description>
			<content:encoded><![CDATA[<p>Long before any code gets written or any servers deployed, a quiet yet crucial job is being performed.  The poor tech who is doing this work won&#8217;t get much credit, and almost certainly none of the glory, but if this job isn&#8217;t done properly, then none of what gets done later will be of much use.</p>
<p>I am, of course, talking about&#8230; requirements gathering (bom bom bommmmmm).</p>
<p>In the case of project starbug, the requirements gathering work sits at the &#8220;fairly straightforward&#8221; end of the spectrum &#8212; but it&#8217;s by no means easy.  What makes the job easier than average is that the site is currently operational, and our primary job is making sure that the new server farm that we&#8217;re building will (a) match what is currently running (in terms of system setup), and (b) have enough capacity for future growth.</p>
<p>The system configuration to support the customer&#8217;s application is fairly easy to achieve for this project &#8212; the customer knows their software and what it requires really, really well, and we&#8217;ve got the existing setup to examine to try and work out how the pieces fit together if it isn&#8217;t immediately obvious.  The main requirement here is to make sure that all the requirements are documented thoroughly.  Yeah, writing docs isn&#8217;t the glamourous end of the job, but it is important, and is something that pays dividends down the line.  More on that in another article, though.</p>
<p>The capacity issue is a lot trickier.  The new architecture we&#8217;re building is completely different to the current architecture (which isn&#8217;t scaling well, hence why it&#8217;s being left behind), so it&#8217;s hard to draw any direct performance metrics by just looking at what hardware is already in use (especially since the current setup uses virtualisation a little too heavily, which makes comparisons based on hardware spec even harder).</p>
<p>Based on a cursory examination of <a href="http://www.anchor.com.au/hosting/development/HuntingThePerformanceWumpus">the bottlenecks in the existing system</a>, along with previous knowledge of the system behaviour, I decided that the primary bottleneck of the system is disk I/O.  This site isn&#8217;t your typical large-scale website; it does a lot more file management than is typical.  As a result, the key thing we need to ensure in our new hardware setup is that there is sufficient disk I/O capacity.</p>
<p>Memory constraints (large app servers, mostly) take a close second in the &#8220;what is going to kill us here&#8221; stakes, as the current infrastructure is using somewhere north of 100GB of RAM (spread across all the various servers that are being used).  We want to provision this plus some extra, as moah RAMs == moah disk caching, and moah disk caching == better effective disk I/O.  Win all round!</p>
<p>CPU, on the other hand, is practically never an issue.  The servers run a lot of separate processes, but they&#8217;re almost always waiting on stuff coming from the disk, so with the current state of the art in server CPUs being quad core, we really shouldn&#8217;t have a CPU bottleneck.</p>
<p>Although I said earlier that memory and disk I/O were tied for the title of &#8220;biggest performance bottleneck&#8221;, there was really no competition for which one of these was going to keep me up at night.  Solving the memory problem is easy &#8212; modern chassis can easily accomodate 32GB (or more) of RAM.  There was never any doubt that we&#8217;d be using at least a half dozen machines, so stocking them all with 32GB of RAM should be plenty.</p>
<p>No, the worry was always going to be the file I/O capacity, and making sure that we had both the speed we needed, as well as the storage capacity.  While the site doesn&#8217;t need petabytes of storage, it does need a decent amount of space, and it all needs to be pretty quick.  What&#8217;s annoying (but understandable) about storage systems is that you can either have a lot of capacity (1.5TB SATA drives are common as MCSEs) or you can have a lot of speed (15k SAS drives max out at 300GB).  We could get the storage space we needed with 300GB drives, but will it be quick enough?</p>
<p>To try and make <em>some</em> sort of an apples-to-apples comparison, I needed to have a number that represented how much I/O was being done at present, and which could be compared to what our new hardware infrastructure is capable of.</p>
<p>In the end, what I went with was running the <tt>sar</tt> tool on a number of the existing machines to try and get an idea of how much disk I/O is being requested by the machines.  There are a number of things that might make this comparison inaccurate, but in the end I decided that there wasn&#8217;t really any better metric.</p>
<p>The key thing was to try and get the statistics at the same &#8220;layer&#8221; of the stack in both cases &#8212; in this case, when the kernel passes the I/O request off to the disk (or RAID controller, in this case).  The benefits of this are that it&#8217;s a single statistic to compare, and it&#8217;s not ridiculously impossible to synthesise a load at this level for benchmarking purposes (obviously, running the live site on the new infrastructure to benchmark the new hardware isn&#8217;t a real winning strategy).  When all&#8217;s said and done, though, these benchmarks are an estimate, and are unlikely to be completely accurate.  That needs to be kept in mind when doing the hardware estimations later on.</p>
<p>All of this information gathering and benchmarking takes a pile of effort, but without it there&#8217;s no chance whatsoever that any sizeable infrastructure will be correct for the job it needs to do.  I was surprised in this case at how little hardware we ended up needing, however on a previously sized system I worked on the initial guesstimates turned out to be an order of magnitude too low (the system ended up with some thirty-odd servers instead of the five initially ordered).  Without a comprehensive analysis of the reality of the situation, you&#8217;re either going to end up with a poorly performing site, or else a pile of unused hardware.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.anchor.com.au/blog/2009/08/know-thy-enemy/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Infrastructure development as performance art</title>
		<link>http://www.anchor.com.au/blog/2009/08/infrastructure-development-as-performance-art/</link>
		<comments>http://www.anchor.com.au/blog/2009/08/infrastructure-development-as-performance-art/#comments</comments>
		<pubDate>Mon, 03 Aug 2009 03:24:09 +0000</pubDate>
		<dc:creator>matt</dc:creator>
				<category><![CDATA[FTW]]></category>
		<category><![CDATA[infrastructure development]]></category>
		<category><![CDATA[project starbug]]></category>

		<guid isPermaLink="false">http://www.anchor.com.au/blog/?p=1037</guid>
		<description><![CDATA[Anchor recently signed a new customer. This is not normally news, but then again, this is not a normal customer. They&#8217;re fairly sizeable, and need a large scale dedicated infrastructure to handle their request volume. Because of the scale of the development, and some of the novel approaches we&#8217;re going with, we&#8217;ve decided to blog [...]]]></description>
			<content:encoded><![CDATA[<p>Anchor recently signed a new customer.  This is not normally news, but then again, this is not a normal customer.  They&#8217;re fairly sizeable, and need a large scale dedicated infrastructure to handle their request volume.</p>
<p>Because of the scale of the development, and some of the novel approaches we&#8217;re going with, we&#8217;ve decided to blog about the experience of setting it all up.  In effect, we&#8217;ll be doing the development of this infrastructure in public.  Over the next couple of months, as everything comes together, I&#8217;ll be regularly writing up what we&#8217;re doing, how we&#8217;re doing it, and the good, the bad, and the ugly.  Some details will need to be obscured, for customer confidentiality reasons, but as much information will be made public as we possibly can.  If you&#8217;ve never been involved in a big infrastructure project, hopefully you&#8217;ll be able to get a feel for what goes into something like this.</p>
<p>We&#8217;re code naming this sizeable effort &#8220;Project Starbug&#8221;, and all the blog posts in this series will be tagged with &#8220;project starbug&#8221;, for ease of identification.  Follow along, and watch the adventure unfold&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.anchor.com.au/blog/2009/08/infrastructure-development-as-performance-art/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

