<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Anchor Web Hosting Blog &#187; fail</title>
	<atom:link href="http://www.anchor.com.au/blog/tag/fail/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.anchor.com.au/blog</link>
	<description>A view into the Anchor Engineroom</description>
	<lastBuildDate>Thu, 29 Jul 2010 06:35:57 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Why software developers don&#8217;t make good system administrators</title>
		<link>http://www.anchor.com.au/blog/2010/05/why-software-developers-dont-make-good-system-administrators/</link>
		<comments>http://www.anchor.com.au/blog/2010/05/why-software-developers-dont-make-good-system-administrators/#comments</comments>
		<pubDate>Wed, 19 May 2010 00:55:55 +0000</pubDate>
		<dc:creator>Keiran Holloway</dc:creator>
				<category><![CDATA[Newsletter]]></category>
		<category><![CDATA[WTF]]></category>
		<category><![CDATA[best practice]]></category>
		<category><![CDATA[dedicated server]]></category>
		<category><![CDATA[documentation]]></category>
		<category><![CDATA[fail]]></category>
		<category><![CDATA[system administration]]></category>

		<guid isPermaLink="false">http://www.anchor.com.au/blog/?p=1283</guid>
		<description><![CDATA[Straight off the bat I would make something clear:  I have a lot of respect for software and web developers.  Being able to write clean, intelligent and efficient code is certainly one of the more difficult aspects within this industry. With this in mind, I think that anyone who is able to write a consistently [...]]]></description>
			<content:encoded><![CDATA[<p>Straight off the bat I would make something clear:  I have a lot of respect for software and web developers.  Being able to write clean, intelligent and efficient code is certainly one of the more difficult aspects within this industry. With this in mind, I think that anyone who is able to write a consistently high level of code based on often sketchy requirements and delivering this within the usual time pressures of business should be awarded some kind of medal.</p>
<p>That said, I can say with some confidence that we have the pleasure of working with some of the very best software and web developers both locally here in Australia as well as abroad.</p>
<p>Further to this, I can also add quite unreservedly that software developers really don&#8217;t make good system administrators.. And can you really blame them?</p>
<p>Allow me to elaborate a little bit here; As you may have already guessed from the above few paragraphs, software development is tough.  Being a good software developer is even tougher. Under the pretty exterior of most websites there an awful lot of work that goes into making the sites work.  Pulling this together requires a fair amount of consideration through-out all aspects of the software development process, from getting requirements and designing the application through to writing the code, testing, debugging and forever trying to squash that final elusive bug.  It takes someone with a fairly specific skill-set to be able to do all this and to do it well.</p>
<p>Something that I&#8217;ve noticed however, is software developers are sometimes expected to take on the role of server management and look after the on-going running and maintenance of the machine.  Whilst I can appreciate there&#8217;s a similarity between what a software developer and a system administrator does, &#8220;hey, they both do &#8216;computer stuff&#8217;&#8221;, the tasks which are completed by each roles are worlds apart.  A software developer really only cares about getting his or her application working within a specific environment the quickest way possible.  This can sometimes mean that there are  some rather drastic changes to the machine configuration with little consideration to the potentially negative implications. This is pretty understandable,  as far as they&#8217;re concerned, once they get the environment working with their application then they can just continue hacking away on their code.  Given they are probably under other tight deadlines or would just simply be preferring to get on with what they&#8217;re actually being paid to do without much consideration for the longevity and maintainability of the operating system environment.</p>
<p>This is something we see a lot of; from developers downloading source tarballs then compiling and installing software system-wide to running bleeding edge versions of software which just aren&#8217;t suited to being in production.</p>
<p>To give an example of an incident recently which has prompted this post, we had a client call up complaining that they couldn&#8217;t get their postgresql database to start. Whilst this was not on our <a href="http://www.anchor.com.au/dedicated-hosting/dedicated-support.py#complete">fully managed service</a>, we are always willing to help out or clients on a professional consulting basis.  Upon logging in we attempted to start postgresql  and witnessed it failing without too many clues as to what&#8217;s doing on.  Further investigation revealed the following in the postgresql startup logs:</p>
<blockquote><p>FATAL:  database files are incompatible with server<br />
DETAIL:  The database cluster was initialized with CATALOG_VERSION_NO 200812281, but the server was compiled with CATALOG_VERSION_NO 200904091.</p></blockquote>
<p>Further digging revealed that postgresql had recently been updated.. 14 hours ago to be precise. Subsequent to this the database engine had been stopped and then failed to start again.  The client in question actually uses this machine as a mail exchange for his clients and uses a postgresql back-end to manage the mail tables.  This means that for the duration of the outage, no email was working for any of the clients on the machine.  Yes, for 14 hours.  Ouch.</p>
<p>Once we had found the problem, all we needed to do was roll back to the previous version start up postgres and everything would be hunky-dory, right? Well.. Easier said than done.</p>
<p>In this case, the software developer had installed what appears to be a development version of postgresql which was (as the error message alludes to) released in January 2008.  That&#8217;s ok, we should just be able to reinstall the previous version from the RPM on the machine, right?  Wrong. Didn&#8217;t exist.</p>
<p>At this point in time we started to do a quick google and checking the postgresql website to see if they perhaps, just maybe, had a copy of this daily development release somewhere on the website.  No joy there&#8230;</p>
<p>I know! We take backups for any clients who chose to use our managed backup solution, and this client has opted for this service!  As part of our managed backups we roll-out an automated process to take a dump of all the databases and store locally on the disk!  Given this happens at midnight each night and the database stopped running at 8pm we&#8217;ll just be able to restore from the database dumps right?  Wrong.  We didn&#8217;t install postgresql and there is no process in place to do this.</p>
<p>So at this point in time, the dataset was still there but effectively useless and mail services were still down.  Fortunately, we were able to save the day by restoring all the binary files from this specific version of postgresql from backups and thus restore services for the client.  Whilst the motivation behind using this specific version is unknown, the software developer has since moved on and there is zero documentation.  This situation really shouldn&#8217;t have happened in the first place. This type of problem is actually something that we see more often then you would imagine.  We often have developers requesting specific versions of software to use in a production environment.  Obviously, we would strongly, strongly discourage the use of development versions within production (they&#8217;re called DEVELOPMENT versions for a reason, they simply haven&#8217;t been around long enough to be considered stable, reliable software).  However, from time to time  a specific feature or bug fixes within a specific development version which dictates we must install such a version.  This is something we can certainly get  working&#8230;  And, most importantly, keep the machine in a maintainable state! This means having supporting documentation as to the decisions made as well as making sure that routine maintenance tasks will not break the existing, carefully crafted configuration.</p>
<p>I also have another fond memory of a web developer who was having some niggling problems with tomcat and permissions and figured that the best way to solve the problem was using:</p>
<blockquote><p>chown tomcat / -R</p></blockquote>
<p>So, it got the web application working, but broke virtually every other service on the machine.  Can anyone say hosed file system permissions?</p>
<p>&#8230;Or how about the Windows machine which has 4, yes, 4 separate instances of MSSQL installed on it..  I digress.</p>
<p>Without wanting to turn this into a big marketing spiel, it is important to keep in mind that like software development, system administration can be a tough game too.   Obviously in the above examples using hind-sight we can easily identify the problems in what was done previously on the machines.  That said, at Anchor we are a team of system administrators who have been running complex systems for a long time now and have the experience to make sure that all the appropriate precautions are taken to make sure we don&#8217;t end up in these situations above.</p>
<p>Further to this we have numerous systems in place to pro-actively check services including database servers, 24/7. In the event of failure both audible and visual alerts are generated with notifications outside of hours being sent via SMS message service.  Even in the event that this happened on a fully managed machine it would never have resulted in 14 hours down time.  All said, I am not just trying to blow our own horn about how fantastically brilliant we are (ok, maybe, just a little), but what I am trying to get across is system administration is something that really requires an all or nothing attitude towards. If your website or associated hosting infrastructure is critical to your business&#8217; success then making sure the commitment to system management is commensurateable is absolutely imperative to success. Either through outsourcing via our fully managed support pack or by hiring a dedicated system administrator.   There really is no place for laissez-faire and utilising a software developer part-time for this role is only likely to cost more in the longer term.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.anchor.com.au/blog/2010/05/why-software-developers-dont-make-good-system-administrators/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Monitor your servers like it&#8217;s 1996</title>
		<link>http://www.anchor.com.au/blog/2009/12/monitor-your-servers-like-its-1996/</link>
		<comments>http://www.anchor.com.au/blog/2009/12/monitor-your-servers-like-its-1996/#comments</comments>
		<pubDate>Thu, 03 Dec 2009 00:43:23 +0000</pubDate>
		<dc:creator>matt</dc:creator>
				<category><![CDATA[WTF]]></category>
		<category><![CDATA[fail]]></category>
		<category><![CDATA[monitoring]]></category>
		<category><![CDATA[nagios]]></category>
		<category><![CDATA[plugins]]></category>
		<category><![CDATA[thresholds]]></category>

		<guid isPermaLink="false">http://www.anchor.com.au/blog/?p=1398</guid>
		<description><![CDATA[Whilst I&#8217;m a fan of using percentages for my disk space checks, sometimes an explicit size is more appropriate. So, you&#8217;d expect the following to work nicely: $USER1$/check_disk -w 5G -c 1G -p /data/foo If you don&#8217;t actually test that this works (by artificially filling your disk and seeing what happens), you may be dismayed [...]]]></description>
			<content:encoded><![CDATA[<p>Whilst I&#8217;m a fan of using percentages for my disk space checks, sometimes an explicit size is more appropriate.  So, you&#8217;d expect the following to work nicely:</p>
<pre>
$USER1$/check_disk -w 5G -c 1G -p /data/foo
</pre>
<p>If you don&#8217;t actually test that this works (by artificially filling your disk and seeing what happens), you may be dismayed to find that you only get alerted when the disk has 5MB of free disk space.  Why is this?</p>
<p>Because Nagios, despite the fact that nobody has sweated the megabytes for about a gazillion years, doesn&#8217;t support &#8216;G&#8217; as a suffix for thresholds.  Oh, it&#8217;ll make a good show of pretending &#8212; after all, the output formatting options have &#8216;GB&#8217; as an option &#8212; but nope, for your thresholds it&#8217;s &#8220;5000M&#8221; all the way.</p>
<p>ROCK ON!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.anchor.com.au/blog/2009/12/monitor-your-servers-like-its-1996/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>I always knew webmin was arse, but this&#8230;</title>
		<link>http://www.anchor.com.au/blog/2009/11/i-always-knew-webmin-was-arse-but-this/</link>
		<comments>http://www.anchor.com.au/blog/2009/11/i-always-knew-webmin-was-arse-but-this/#comments</comments>
		<pubDate>Tue, 17 Nov 2009 15:58:05 +0000</pubDate>
		<dc:creator>matt</dc:creator>
				<category><![CDATA[WTF]]></category>
		<category><![CDATA[fail]]></category>
		<category><![CDATA[failwall]]></category>
		<category><![CDATA[webmin]]></category>

		<guid isPermaLink="false">http://www.anchor.com.au/blog/?p=1376</guid>
		<description><![CDATA[This is the output of iptables -L on a webmin-managed box I just saw: Chain INPUT (policy ACCEPT) target prot opt source destination ACCEPT all -- anywhere anywhere ACCEPT all -- anywhere anywhere ACCEPT tcp -- anywhere anywhere tcp flags:ACK/ACK ACCEPT all -- anywhere anywhere state ESTABLISHED ACCEPT all -- anywhere anywhere state RELATED ACCEPT [...]]]></description>
			<content:encoded><![CDATA[<p>This is the output of <tt>iptables -L</tt> on a webmin-managed box I just saw:</p>
<pre>
Chain INPUT (policy ACCEPT)
target     prot opt source               destination
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     tcp  --  anywhere             anywhere            tcp flags:ACK/ACK
ACCEPT     all  --  anywhere             anywhere            state ESTABLISHED
ACCEPT     all  --  anywhere             anywhere            state RELATED
ACCEPT     udp  --  anywhere             anywhere            udp spt:domain dpts:1024:65535
ACCEPT     icmp --  anywhere             anywhere            icmp any
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:ftp
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:ssh
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:smtp
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:domain
ACCEPT     udp  --  anywhere             anywhere            udp dpt:domain
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:http
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:pop3
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:imap
ACCEPT     udp  --  anywhere             anywhere            udp dpt:imap
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:https
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:mysql
ACCEPT     tcp  --  anywhere             anywhere            tcp spt:mysql
ACCEPT     tcp  --  anywhere             anywhere            tcp dpts:terabase:samsung-unidex
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:ndmp
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:dnp
LOG        all  --  anywhere             anywhere            LOG level debug prefix `DROPPED = '
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:ftp-data 

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
ACCEPT     tcp  --  anywhere             anywhere            tcp spt:ftp-data dpt:ftp-data
ACCEPT     tcp  --  anywhere             anywhere            tcp spt:ftp dpt:ftp
</pre>
<p>Lovely that it has all those ports and whatnot opened up, but what&#8217;s with the <tt>ACCEPT</tt> policies?</p>
<p><b>Webmin: Now with FAILWALL management!</b></p>
<p>I should have been in marketing.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.anchor.com.au/blog/2009/11/i-always-knew-webmin-was-arse-but-this/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Ooh, bugger&#8230;</title>
		<link>http://www.anchor.com.au/blog/2009/10/ooh-bugger/</link>
		<comments>http://www.anchor.com.au/blog/2009/10/ooh-bugger/#comments</comments>
		<pubDate>Fri, 02 Oct 2009 04:07:52 +0000</pubDate>
		<dc:creator>Barney Desmond</dc:creator>
				<category><![CDATA[WTF]]></category>
		<category><![CDATA[colocation]]></category>
		<category><![CDATA[datacentre]]></category>
		<category><![CDATA[fail]]></category>
		<category><![CDATA[pwned]]></category>
		<category><![CDATA[racks]]></category>
		<category><![CDATA[sun]]></category>

		<guid isPermaLink="false">http://www.anchor.com.au/blog/?p=1246</guid>
		<description><![CDATA[And this is why we co-locate in Globalswitch, a top-tier facility with floors that AREN&#8217;T MADE OF BALSA WOOD.]]></description>
			<content:encoded><![CDATA[<p>And this is why we co-locate in Globalswitch, a top-tier facility with floors that <em>AREN&#8217;T MADE OF BALSA WOOD</em>.</p>
<div id="attachment_1245" class="wp-caption alignnone" style="width: 310px"><a href="http://www.anchor.com.au/blog/wp-content/uploads/2009/10/sun_racks.jpg"><img class="size-medium wp-image-1245 " src="http://www.anchor.com.au/blog/wp-content/uploads/2009/10/sun_racks-300x261.jpg" alt="Racks are pretty heavy, sure, but they totally wtfpwned those tables there" width="300" height="261" /></a><p class="wp-caption-text">Racks are pretty heavy, sure, but they totally wtfpwned those tables there</p></div>
]]></content:encoded>
			<wfw:commentRss>http://www.anchor.com.au/blog/2009/10/ooh-bugger/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>When HA won&#8217;t play the way you want it to</title>
		<link>http://www.anchor.com.au/blog/2009/09/when-ha-wont-play-the-way-you-want-it-to/</link>
		<comments>http://www.anchor.com.au/blog/2009/09/when-ha-wont-play-the-way-you-want-it-to/#comments</comments>
		<pubDate>Tue, 08 Sep 2009 03:26:29 +0000</pubDate>
		<dc:creator>oliver</dc:creator>
				<category><![CDATA[FTW]]></category>
		<category><![CDATA[drbd]]></category>
		<category><![CDATA[fail]]></category>
		<category><![CDATA[ha]]></category>
		<category><![CDATA[heartbeat]]></category>
		<category><![CDATA[oracle]]></category>
		<category><![CDATA[shoehorn]]></category>
		<category><![CDATA[win]]></category>
		<category><![CDATA[windows]]></category>

		<guid isPermaLink="false">http://www.anchor.com.au/blog/?p=1116</guid>
		<description><![CDATA[In an ideal world every service would support High Availability and Load Balancing, would scale up easily and cleanly and all of us systems administrators would be paid bucketloads to play golf all day while the computers did all the hard work. To quote Dylan Moran of Black Books fame, &#8220;Don&#8217;t make me laugh&#8230;bitterly&#8221;. I&#8217;ll [...]]]></description>
			<content:encoded><![CDATA[<p>In an ideal world every service would support High Availability and Load Balancing, would scale up easily and cleanly and all of us systems administrators would be paid bucketloads to play golf all day while the computers did all the hard work. To quote Dylan Moran of Black Books fame, &#8220;Don&#8217;t make me laugh&#8230;bitterly&#8221;.</p>
<p>I&#8217;ll cut to the chase &#8211; sometimes you have to really shoehorn technologies to do what you want. Fortunately I love doing this, and the technologies of today&#8217;s article are virtualised Windows 2008 on Xen, and Oracle XE 10g. Neither likes to play ball, for a few reasons:</p>
<ul>
<li>Generally speaking, when you virtualise an OS you want to have para-virtualisation drivers enhancing the hardware support. Open Source Xen has PV drivers, but they are not signed with a legitimate certificate. Windows 2008 does not play nicely with unsigned or test-cert-signed drivers.</li>
<li>Oracle is just a messy, messy, nasty thing. Yes, paid versions undoubtedly support all manner of loadbalancing and HA options, but the free one does not.</li>
</ul>
<h2>Adding HA to Windows 2008 on Xen</h2>
<p>The basic procedure was as follows:</p>
<ul>
<li>Install the telnet server within Windows (making sure to lock it down in the firewall to only be accessible by the host machines)</li>
<li>Create a special admin account and password used for triggering a shutdown</li>
<li>Create an Expect script which logs into the VM via telnet, and issues the shutdown command</li>
<li>Create a modified version of the Heartbeat Xen resource agent which calls the expect script to shut down the VM (and wait a safe period of time) before &#8220;xm shutdown&#8221; is called. Without this, &#8220;xm shutdown&#8221; will simply power off the VM (in absence of working PV drivers).</li>
</ul>
<p>The VM was already running on a DRBD volume between the two HA Xen servers, so I was able to just create a standard set of Heartbeat resources to control DRBD primary/secondary mode and the startup/shutdown of the HA WIndows VM. For your benefit (if you want to recreate it) here is the expect script:</p>
<pre>#!/usr/bin/expect -f
#
# Script which "automates" shutting down a Windows VM

# Don't log telnet output and commands to stdout, and set a reasonable timeout.
log_user 0
set timeout 3

# Log in via telnet and issue commands. Fairly straightforward.
spawn -noecho /usr/bin/telnet 192.168.1.1
sleep 0.5

# login as the "shutdown" user
expect {
 -re "login: $" {send "shutdown\r"}
 timeout exit
}
sleep 0.5
expect {
 -re "password: $" {send "mysecretpassword\r"}
 timeout exit
}
sleep 0.5
expect {
 -re "&gt;$" {send "shutdown /s /t 0\r"}
 timeout exit
}
sleep 0.1
expect {
 -re "&gt;$" {send "exit\r"}
 timeout exit
}
exit</pre>
<p>The rest is fairly self-explanatory if you understand Heartbeat.</p>
<h2>Oracle XE 10g</h2>
<p>This was more of a learning process, since usually you just install Oracle and leave it the hell alone. Not so for me.</p>
<ul>
<li>Install Oracle on both nodes using (fortunately) the RPMs they provide</li>
<li>Configure Oracle on both nodes including creating the databases, using the same password for SYSDBA</li>
<li>Shutdown both instances of Oracle</li>
<li>Create the DRBD resource, and mount it on the primary node</li>
<li>On the primary node, move the contents of /usr/lib/oracle/xe/oradata and /usr/lib/oracle/xe/app/oracle/flash_recovery_area onto the mounted DRBD</li>
<li>On the secondary node, delete the aforementioned paths</li>
<li>Bind mount the oradata and flash recovery area from the mounted DRBD volume into the correct places in the directory tree.</li>
<li>Start Oracle</li>
</ul>
<p>After I had created a Heartbeat resource group which contained the DRBD resource, the DRBD filesystem mount, the aforementioned bind mounts and the Oracle service itself I was quite pleased to see that Oracle plays quite nicely with our shoehorned HA setup. You&#8217;ll want to make sure you have a <a href="http://www.anchor.com.au/blog/2009/07/oracle-why-dost-thou-sucketh-so-prodigiously/">properly fixed</a> Oracle init script though, as the supplied one is fairly bad.</p>
<p>After making Oracle and Windows 2008 work nicely in HA, I&#8217;m almost certain any service no matter how bad can be shoehorned in a similar way to give you decent availability even when it was n&#8217;t originally intended.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.anchor.com.au/blog/2009/09/when-ha-wont-play-the-way-you-want-it-to/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>This just in, from the Department of the Bleedin&#8217; Obvious</title>
		<link>http://www.anchor.com.au/blog/2009/09/this-just-in-from-the-department-of-the-bleedin-obvious/</link>
		<comments>http://www.anchor.com.au/blog/2009/09/this-just-in-from-the-department-of-the-bleedin-obvious/#comments</comments>
		<pubDate>Tue, 08 Sep 2009 02:20:01 +0000</pubDate>
		<dc:creator>Barney Desmond</dc:creator>
				<category><![CDATA[WTF]]></category>
		<category><![CDATA[fail]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[marketing]]></category>
		<category><![CDATA[spam]]></category>
		<category><![CDATA[windows]]></category>

		<guid isPermaLink="false">http://www.anchor.com.au/blog/?p=1113</guid>
		<description><![CDATA[I kid you not, we just received this in a piece of marketing guff from our favourite enterprise vendor. &#8220;Industry analysts predict that Linux and Windows will soon dominate the operating system space. How you respond to this is critical.&#8221; Meanwhile, industry analysts predict that more than 98% of the population will be consuming oxygen [...]]]></description>
			<content:encoded><![CDATA[<p>I kid you not, we just received this in a piece of marketing guff from our favourite enterprise vendor.</p>
<blockquote><p>&#8220;Industry analysts predict that Linux and Windows will soon dominate the operating system space. How you respond to this is critical.&#8221;</p></blockquote>
<p>Meanwhile, industry analysts predict that more than 98% of the population will be consuming oxygen by 2010.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.anchor.com.au/blog/2009/09/this-just-in-from-the-department-of-the-bleedin-obvious/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Large filesystem &#8220;support&#8221;</title>
		<link>http://www.anchor.com.au/blog/2009/04/large-filesystem-support/</link>
		<comments>http://www.anchor.com.au/blog/2009/04/large-filesystem-support/#comments</comments>
		<pubDate>Fri, 24 Apr 2009 06:22:41 +0000</pubDate>
		<dc:creator>oliver</dc:creator>
				<category><![CDATA[WTF]]></category>
		<category><![CDATA[8TB]]></category>
		<category><![CDATA[anaconda]]></category>
		<category><![CDATA[centos]]></category>
		<category><![CDATA[fail]]></category>
		<category><![CDATA[gpt]]></category>
		<category><![CDATA[kickstart]]></category>
		<category><![CDATA[lvm]]></category>
		<category><![CDATA[parted]]></category>

		<guid isPermaLink="false">http://www.anchor.com.au/blog/?p=801</guid>
		<description><![CDATA[I&#8217;ve written recently on how to handle systems with very large storage subsystems. One would think that as we make our way through 2009 that the supporting tools for such large filesystems are at the top of their game, but as I&#8217;ve been playing with 24TB of storage I&#8217;ve realised that this is hardly the [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve written <a href="http://www.anchor.com.au/hosting/dedicated/HandlingMassiveFileSystems">recently</a> on how to handle systems with very large storage subsystems. One would think that as we make our way through 2009 that the supporting tools for such large filesystems are at the top of their game, but as I&#8217;ve been playing with 24TB of storage I&#8217;ve realised that this is hardly the case:</p>
<ul>
<li>The most commonly used bootloader for Linux systems, GRUB, doesn&#8217;t yet have capabilities to boot from GPT partitions (at least not in the stable release)</li>
<li>The most commonly used partitioner, fdisk, doesn&#8217;t support GPT-partitioned disks (and hence no disk larger than 2TB)</li>
<li>GNU parted, which <em>does </em>support GPT, insists on performing all partition resize operations itself (including resizing the contained filesystem). Since it doesn&#8217;t yet understand LVM, it can&#8217;t resize any partition that contains an LVM PV.</li>
</ul>
<p>Today I ran into what appears to be a bug in the CentOS 5.3 installation partitioner, which left my 12TB RAID volume only partitioned to 8TB when I had supplied the &#8211;grow parameter in the Kickstart script. Since parted can&#8217;t resize LVM partitions, and there don&#8217;t appear to be any other tools out there at the moment for GPT partitioning on Linux, I&#8217;m left in a less than ideal position.</p>
<p>GNU parted can&#8217;t resize the partition because it can&#8217;t understand LVM. Fortunately I can just use it to create another partition with the remaining space and add it to the existing LVM volume group but this is really just a hack, and one that disturbs my obsessive-compulsive sysadmin nature. Were it not for the flexibility of LVM, we would be in a bit of a mess.</p>
<p>Sadly, it seems the large filesystem support that will soon become essential for everyone is largely lacking in adequate support.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.anchor.com.au/blog/2009/04/large-filesystem-support/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Standards? Who needs standards?</title>
		<link>http://www.anchor.com.au/blog/2009/04/standards-who-needs-standards/</link>
		<comments>http://www.anchor.com.au/blog/2009/04/standards-who-needs-standards/#comments</comments>
		<pubDate>Mon, 06 Apr 2009 08:06:10 +0000</pubDate>
		<dc:creator>oliver</dc:creator>
				<category><![CDATA[WTF]]></category>
		<category><![CDATA[allocations]]></category>
		<category><![CDATA[apc]]></category>
		<category><![CDATA[fail]]></category>
		<category><![CDATA[iana]]></category>
		<category><![CDATA[multicast]]></category>
		<category><![CDATA[pdu]]></category>
		<category><![CDATA[standards]]></category>

		<guid isPermaLink="false">http://www.anchor.com.au/blog/?p=752</guid>
		<description><![CDATA[Anyone in the sysadmin or developer worlds will know many examples of flagrant violations of standards in the IT world. Some are perpetrated by our coworkers, but a surprisingly high amount are perpetrated by vendors. Not all of them are by Microsoft, either! One big win for systems administration at Anchor is our use of [...]]]></description>
			<content:encoded><![CDATA[<p>Anyone in the sysadmin or developer worlds will know many examples of flagrant violations of standards in the IT world. Some are perpetrated by our coworkers, but a surprisingly high amount are perpetrated by vendors. Not all of them are by Microsoft, either!</p>
<p>One big win for systems administration at Anchor is our use of APC Rack Power Distribution Units. These have been documented elsewhere in our <a href="http://www.anchor.com.au/blog/tag/datacentre/">blog</a> and <a href="http://www.anchor.com.au/hosting/planning/Server_Rack_Equipment_Layout_and_Cable_Organisation">wiki</a> but suffice it to say that having remote control over your power ports is a Very Good Thing. Situations where you have servers or other devices with multiple power supply units complicates things slightly, but <em>not that much, </em>especially with the aforementioned Rack PDUs in place.</p>
<p>APCs in particular allow you to configure what are called Multicast Groups. Essentially you tell a couple of the Rack PDUs to talk to each other and share information, and WHAMMO you can turn off and turn on a bunch of ports on separate Rack PDUs <strong>simultaneously!</strong> So rather than turning off the power to one PSU then rebooting the other, you can conduct a reboot of the power to both PDUs with a single command.</p>
<p>The confusion comes during the configuration of the Multicast Group option. Multicast is a very under-utilised feature of IPv4 (which has now partially been rectified in IPv6), in fact a large chunk of the IPv4 address space is allocated to multicast (and is technically called the Class D space). As with all other portions of IP address-space, this has been carefully portioned into sections and allocated to various purposes. You can see the full list here:</p>
<p><a href="http://www.iana.org/assignments/multicast-addresses/">http://www.iana.org/assignments/multicast-addresses/</a></p>
<p>Being a good sysadmin I consider standards to be of paramount importance, so naturally I wanted to configure our Rack PDUs with multicast addresses suitable for the purpose. There are <a href="http://tools.ietf.org/html/rfc2770">many</a> <a href="http://tools.ietf.org/html/rfc2365">existing</a> <a href="http://www.29west.com/docs/THPM/multicast-address-assignment.html">references</a> on the Internet for how to pick sane and standards-obeying addresses from the multicast range. However, when attempting to follow standards and good reason, I was confronted with this error message:</p>
<pre>Multicast IP Address is out of range. Valid values are 224.0.0.3 - 224.0.0.254.</pre>
<p>Uh, what? I was under the impression that the range 224.0.0.0/24 was already heavily allocated to entities and purposes <strong>other than APC Rack PDUs!</strong> So much for following the standard, APC.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.anchor.com.au/blog/2009/04/standards-who-needs-standards/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Filebucketing to the MAXXXXX</title>
		<link>http://www.anchor.com.au/blog/2009/03/filebucketing-to-the-maxxxxx/</link>
		<comments>http://www.anchor.com.au/blog/2009/03/filebucketing-to-the-maxxxxx/#comments</comments>
		<pubDate>Thu, 12 Mar 2009 02:41:08 +0000</pubDate>
		<dc:creator>oliver</dc:creator>
				<category><![CDATA[WTF]]></category>
		<category><![CDATA[fail]]></category>
		<category><![CDATA[filebucket]]></category>
		<category><![CDATA[state]]></category>
		<category><![CDATA[weblogic]]></category>
		<category><![CDATA[windows]]></category>

		<guid isPermaLink="false">http://www.anchor.com.au/blog/?p=549</guid>
		<description><![CDATA[Every now and then we see an example of application failure so astounding it literally brings tears to our eyes. We have a client whose legacy application is unfortunately still running on an ancient version of Oracle Weblogic and which must be maintained until the new, flashy .NET version of their site is complete. We [...]]]></description>
			<content:encoded><![CDATA[<p>Every now and then we see an example of application failure so astounding it literally brings tears to our eyes. We have a client whose legacy application is unfortunately still running on an ancient version of Oracle Weblogic and which must be maintained until the new, flashy .NET version of their site is complete.</p>
<p>We were alerted this morning to a problem with some of the Weblogic content &#8211; the pages were timing out. Diagnostics were fairly fruitless &#8211; packet captures showed nothing useful, and the logging from Weblogic left much to be desired. We started considering more outlandish possibilities such as I/O load causing issues, recently applied updates and so on. Even rebooting was considered (given it is running on Windows).</p>
<p>The first clue of note was the open file list from the Weblogic processes &#8211; one such example stood out:</p>
<pre>﻿<span style="font-size: small;">C:\weblogic\state\Sa0V\b1gR\O1Ok\WqYN\9kiv\IQT2\SHGx\C3ri\aE1z\L1YH\X5QW\
gdkB\B2PB\pPPw\uHDK\p1a7\I0l5\94sU\kQ43\+533\5517\5738\7484\6253\_-10\
6273\1519\_6_8\888_\8888\_700\2_702_8\888_

</span></pre>
<p><span style="font-size: small;">For the sake of your screen, I have manually wrapped this Godzilla-like filename.</span></p>
<pre></pre>
<pre></pre>
<p>Perhaps you are familiar with <a href="http://www.anchor.com.au/hosting/dedicated/efficient_file_storage">file bucketing</a> already, but if not, typically the directory structure used will have a relatively sane scheme for locating files and only extend a few levels deep. What we saw in this instance was a completely new breed of monster. Admittedly the absolute path of this file is less than 200 characters out of a limit of more than 32,000 but the naming strategy and depth of the structure has us flummoxed.</p>
<p>But this was only the tip of the proverbial iceberg. When we requested Windows to show us the properties of this state folder it took over an hour to completely calculate the file and folder totals, and the result is impressive:</p>
<div id="attachment_551" class="wp-caption aligncenter" style="width: 380px"><img class="size-full wp-image-551" src="http://www.anchor.com.au/blog/wp-content/uploads/2009/03/weblogicfail.png" alt="Web logic makes efficient use of the filesystem" width="370" height="476" /><p class="wp-caption-text">Web logic makes efficient use of the filesystem</p></div>
<p>Yes you read that right &#8211; over 10 million nested directories. By this stage we had already moved the state directory out of the way and created a new one, and restarted Weblogic. It seemed happy and quite responsive after that. My suspicion is that someone developing this application at some point ran into a limitation with their filebucketing algorithm, and resolved to solve the problem <strong>once and for all, </strong>evidently by making it possible to efficiently filebucket every file in the known universe.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.anchor.com.au/blog/2009/03/filebucketing-to-the-maxxxxx/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A tale of two drives</title>
		<link>http://www.anchor.com.au/blog/2008/10/a-tale-of-two-drives/</link>
		<comments>http://www.anchor.com.au/blog/2008/10/a-tale-of-two-drives/#comments</comments>
		<pubDate>Thu, 09 Oct 2008 00:09:06 +0000</pubDate>
		<dc:creator>Barney Desmond</dc:creator>
				<category><![CDATA[WTF]]></category>
		<category><![CDATA[crash]]></category>
		<category><![CDATA[fail]]></category>
		<category><![CDATA[hard disk]]></category>
		<category><![CDATA[raid]]></category>
		<category><![CDATA[windows]]></category>

		<guid isPermaLink="false">http://www.anchor.com.au/blog/?p=17</guid>
		<description><![CDATA[It&#8217;s no secret that we&#8217;d rather be working on Linux than Windows here at Anchor. It is, by and large, much more annoying to actually get anything done, but it also just breaks in opaque and unexplained ways. O Windowes, let me count the ways in which you are broken! This is one such problem [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s no secret that we&#8217;d rather be working on Linux than Windows here at Anchor. It is, by and large, much more annoying to actually get anything done, but it also just breaks in opaque and unexplained ways. <em>O Windowes, let me count the ways in which you are broken!</em> This is one such problem we ran into yesterday.</p>
<p>Hard drive failure is a fact of life when you run servers, by sheer virtue of that fact that you have hundreds of them. To mitigate the risk and reduce unscheduled downtime, we use Window&#8217;s built-in software RAID feature. It&#8217;s not an enterprise solution, but it gets the job done. What&#8217;s important is staying online and not losing data.</p>
<p>Did I mention that trying to monitor a Windows box is a nightmare? A colleague of mine wrote <a href="http://www.anchor.com.au/hosting/dedicated/monitoring_windows_software_raid" target="_blank">a script to allow us to keep a watchful eye on Windows RAID volumes</a>, it&#8217;s a lifesaver. A recently-deployed machine got a broken mirror, which we were able to act on immediately. We removed the dodgy mirror and prepared a replacement (we always have plenty of spares, of course). Allow me now to re-enact this scene&#8230;</p>
<blockquote><p>Windows (sounding almost efficient): The driver has detected that device \Device\Harddisk1\DR9 has predicted that it will fail</p>
<p>Sysadmin: Thanks, Windows, I&#8217;ll get right on that. You didn&#8217;t say whether that was SMART, or just voodoo, but whatever, it&#8217;s good to know.</p>
<p><em>The bad drive is removed and a replacement installed in the hotswap drive bay</em></p>
<p>Sysadmin: Okay, Windows, do your stuff. &#8220;Scan for new hardware&#8221;, please.</p>
<p><em>A pause.</em></p>
<p>Sysadmin: Ahem, Windows, &#8220;Scan for new hardware&#8221; and find my drive.</p>
<p>Windows: &#8216;Ey there, chaps. Do what now, you say? AIEEEEGRH!!</p>
<p><em>The server stops responding entirely, necessitating a touch of the reset button</em></p></blockquote>
<p>Needless to say, we&#8217;re rather unimpressed, and have to call the customer to let them know why it&#8217;s just dropped offline.</p>
<p>A quick check of the logs is in order. It&#8217;s also frustrating that there&#8217;s no sane way to scroll through log entries in Windows with something like a text editor, or to &#8220;tail&#8221; a log as it&#8217;s updated in realtime.</p>
<blockquote><p>09:36 &#8211; The previous system shutdown at 9:21:23 AM on 8/10/2008 was unexpected.</p>
<p><em>Okay, it went down at 09:21, which is correct. Now if we look back in time a little&#8230;</em></p>
<p>09:21 &#8211; dmio: Harddisk1 write error at block 1953524618 due to disk removal</p></blockquote>
<p>*sigh* And this is after the disk was removed cleanly&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.anchor.com.au/blog/2008/10/a-tale-of-two-drives/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
