Archive for the ‘WTF’ Category

Why software developers don’t make good system administrators

Wednesday, May 19th, 2010

Straight off the bat I would make something clear:  I have a lot of respect for software and web developers.  Being able to write clean, intelligent and efficient code is certainly one of the more difficult aspects within this industry. With this in mind, I think that anyone who is able to write a consistently high level of code based on often sketchy requirements and delivering this within the usual time pressures of business should be awarded some kind of medal.

That said, I can say with some confidence that we have the pleasure of working with some of the very best software and web developers both locally here in Australia as well as abroad.

Further to this, I can also add quite unreservedly that software developers really don’t make good system administrators.. And can you really blame them?

Allow me to elaborate a little bit here; As you may have already guessed from the above few paragraphs, software development is tough.  Being a good software developer is even tougher. Under the pretty exterior of most websites there an awful lot of work that goes into making the sites work.  Pulling this together requires a fair amount of consideration through-out all aspects of the software development process, from getting requirements and designing the application through to writing the code, testing, debugging and forever trying to squash that final elusive bug.  It takes someone with a fairly specific skill-set to be able to do all this and to do it well.

Something that I’ve noticed however, is software developers are sometimes expected to take on the role of server management and look after the on-going running and maintenance of the machine.  Whilst I can appreciate there’s a similarity between what a software developer and a system administrator does, “hey, they both do ‘computer stuff’”, the tasks which are completed by each roles are worlds apart.  A software developer really only cares about getting his or her application working within a specific environment the quickest way possible.  This can sometimes mean that there are some rather drastic changes to the machine configuration with little consideration to the potentially negative implications. This is pretty understandable,  as far as they’re concerned, once they get the environment working with their application then they can just continue hacking away on their code.  Given they are probably under other tight deadlines or would just simply be preferring to get on with what they’re actually being paid to do without much consideration for the longevity and maintainability of the operating system environment.

This is something we see a lot of; from developers downloading source tarballs then compiling and installing software system-wide to running bleeding edge versions of software which just aren’t suited to being in production.

To give an example of an incident recently which has prompted this post, we had a client call up complaining that they couldn’t get their postgresql database to start. Whilst this was not on our fully managed service, we are always willing to help out or clients on a professional consulting basis.  Upon logging in we attempted to start postgresql and witnessed it failing without too many clues as to what’s doing on. Further investigation revealed the following in the postgresql startup logs:

FATAL:  database files are incompatible with server
DETAIL:  The database cluster was initialized with CATALOG_VERSION_NO 200812281, but the server was compiled with CATALOG_VERSION_NO 200904091.

Further digging revealed that postgresql had recently been updated.. 14 hours ago to be precise. Subsequent to this the database engine had been stopped and then failed to start again. The client in question actually uses this machine as a mail exchange for his clients and uses a postgresql back-end to manage the mail tables.  This means that for the duration of the outage, no email was working for any of the clients on the machine.  Yes, for 14 hours.  Ouch.

Once we had found the problem, all we needed to do was roll back to the previous version start up postgres and everything would be hunky-dory, right? Well.. Easier said than done.

In this case, the software developer had installed what appears to be a development version of postgresql which was (as the error message alludes to) released in January 2008.  That’s ok, we should just be able to reinstall the previous version from the RPM on the machine, right?  Wrong. Didn’t exist.

At this point in time we started to do a quick google and checking the postgresql website to see if they perhaps, just maybe, had a copy of this daily development release somewhere on the website.  No joy there…

I know! We take backups for any clients who chose to use our managed backup solution, and this client has opted for this service!  As part of our managed backups we roll-out an automated process to take a dump of all the databases and store locally on the disk!  Given this happens at midnight each night and the database stopped running at 8pm we’ll just be able to restore from the database dumps right?  Wrong.  We didn’t install postgresql and there is no process in place to do this.

So at this point in time, the dataset was still there but effectively useless and mail services were still down.  Fortunately, we were able to save the day by restoring all the binary files from this specific version of postgresql from backups and thus restore services for the client.  Whilst the motivation behind using this specific version is unknown, the software developer has since moved on and there is zero documentation.  This situation really shouldn’t have happened in the first place. This type of problem is actually something that we see more often then you would imagine.  We often have developers requesting specific versions of software to use in a production environment.  Obviously, we would strongly, strongly discourage the use of development versions within production (they’re called DEVELOPMENT versions for a reason, they simply haven’t been around long enough to be considered stable, reliable software). However, from time to time a specific feature or bug fixes within a specific development version which dictates we must install such a version.  This is something we can certainly get working…  And, most importantly, keep the machine in a maintainable state! This means having supporting documentation as to the decisions made as well as making sure that routine maintenance tasks will not break the existing, carefully crafted configuration.

I also have another fond memory of a web developer who was having some niggling problems with tomcat and permissions and figured that the best way to solve the problem was using:

chown tomcat / -R

So, it got the web application working, but broke virtually every other service on the machine. Can anyone say hosed file system permissions?

…Or how about the Windows machine which has 4, yes, 4 separate instances of MSSQL installed on it..  I digress.

Without wanting to turn this into a big marketing spiel, it is important to keep in mind that like software development, system administration can be a tough game too.   Obviously in the above examples using hind-sight we can easily identify the problems in what was done previously on the machines.  That said, at Anchor we are a team of system administrators who have been running complex systems for a long time now and have the experience to make sure that all the appropriate precautions are taken to make sure we don’t end up in these situations above.

Further to this we have numerous systems in place to pro-actively check services including database servers, 24/7. In the event of failure both audible and visual alerts are generated with notifications outside of hours being sent via SMS message service. Even in the event that this happened on a fully managed machine it would never have resulted in 14 hours down time.  All said, I am not just trying to blow our own horn about how fantastically brilliant we are (ok, maybe, just a little), but what I am trying to get across is system administration is something that really requires an all or nothing attitude towards. If your website or associated hosting infrastructure is critical to your business’ success then making sure the commitment to system management is commensurateable is absolutely imperative to success. Either through outsourcing via our fully managed support pack or by hiring a dedicated system administrator.  There really is no place for laissez-faire and utilising a software developer part-time for this role is only likely to cost more in the longer term.

Automated server updates

Wednesday, March 10th, 2010

This is going to be a contentious one, but here at Anchor we think automatically applying updates to servers is a Good Thing. It’s definitely not for everyone, but in an environment like ours with hundreds of managed servers it’s the only way you’re going to get things done and get any sleep at night.

Sysadmin of note Tom Limoncelli advocates rolling out updates to progressively more machines with prior testing beforehand to mitigate the scope of potential problems (it’s called “one, some, many”). It’s certainly a good strategy for a large number of homogenous computers, but what we’re talking about here is a bit smaller-scale.

Specifically, we have customers with servers that we never touch, we call this Anchor Monitor. These customers often have particular environments that they’re better off supporting themselves, so we monitor the machine to ensure it’s still on the network, and leave it at that. Unfortunately they’re not always kept up to date, so one of the more recent improvements to our process has been to enable automatic updating by default – it’s up to the customer if they want to change this once it’s handed over to them.

We’ve written this into a short procedure if you’re interested. It applies directly to Debian and Redhat distributions, but it’s easily portable to other systems. If you run Windows, it’ll already be hassling you every 20min for updates. :)

ERROR: SSH agent has too many keys

Wednesday, December 23rd, 2009

Unfortunately, SSH doesn’t produce this error, although it darn well should…

I just had a Github customer report that they couldn’t access their repos via SSH, despite it all working properly yesterday, and “not having changed anything”. A bit of debug logging and an inspired leap of intuition on the part of another sysadmin in the office, and the answer was quickly found.

First off, the symptoms:

  • Debug logging showed that the user was connecting successfully, presenting six SSH keys (none of which were the key of interest) before disconnecting;
  • The SSH key was in the user’s SSH agent (you can verify this with a quick ssh-add -l);
  • There were more than six keys in the SSH agent

This last symptom is the key point. As an anti-brute-force measure (I assume), SSH won’t allow a user to connect and present more than MaxAuthTries credentials (whether they be passwords or keys) before being forcibly disconnected. The default value for this parameter (if you haven’t realised already) is six.

Whilst this makes a lot of sense for passwords (and a lesser, but still valid, measure for keys) it does mean that you effectively have a hard limit of six keys in your agent simultaneously (at least without using SSH configs to specify a single key to present to the server). Any more than six keys, and you run the very real risk that the key you need to give to a particular server will be number seven in your agent, and all your authentications will fail miserably.

Bumping the value of MaxAuthTries to a much larger value works fine for Github — password auth is disabled, and if you can manage to brute force a key you’re welcome to what you can get — but you certainly can’t rely on inflating MaxAuthTries everywhere to get you out of trouble, so: keep those SSH agents lean, or at least specify IdentityFile for all your servers.

Monitor your servers like it’s 1996

Thursday, December 3rd, 2009

Whilst I’m a fan of using percentages for my disk space checks, sometimes an explicit size is more appropriate. So, you’d expect the following to work nicely:

$USER1$/check_disk -w 5G -c 1G -p /data/foo

If you don’t actually test that this works (by artificially filling your disk and seeing what happens), you may be dismayed to find that you only get alerted when the disk has 5MB of free disk space. Why is this?

Because Nagios, despite the fact that nobody has sweated the megabytes for about a gazillion years, doesn’t support ‘G’ as a suffix for thresholds. Oh, it’ll make a good show of pretending — after all, the output formatting options have ‘GB’ as an option — but nope, for your thresholds it’s “5000M” all the way.

ROCK ON!

Industry Analysts: Putting the “arse” in Analyst

Tuesday, November 24th, 2009

I’ve never been a real fan of the output of big “industry analysis” firms, since their reports never seemed to really tell the whole story, and didn’t match up with my experiences “in the trenches”. Now I know why. A representative sample:

“I see. So, the companies in your magic quadrant, are they all paying clients of yours?”

“Well, yes they are,” He said, proudly.

“Well, if they are all paying clients, then what’s so ‘magic’ about being in the quadrant?”

“The companies are not all rated at the same level, some are rated much higher than others.”

“And should I be surprised to hear that the companies that pay you more so you can afford to have entire teams cover them full-time; you tend to know a lot about, and they tend to get better ratings?”

No answer.

“Maybe you should stop calling it the ‘Magic Quadrant’ and call it what it really is; perhaps ‘The Quadrant of Companies That Can Afford To Be In It’.

Go read the whole article, though, it’s pure gold.

I always knew webmin was arse, but this…

Wednesday, November 18th, 2009

This is the output of iptables -L on a webmin-managed box I just saw:

Chain INPUT (policy ACCEPT)
target     prot opt source               destination
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     tcp  --  anywhere             anywhere            tcp flags:ACK/ACK
ACCEPT     all  --  anywhere             anywhere            state ESTABLISHED
ACCEPT     all  --  anywhere             anywhere            state RELATED
ACCEPT     udp  --  anywhere             anywhere            udp spt:domain dpts:1024:65535
ACCEPT     icmp --  anywhere             anywhere            icmp any
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:ftp
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:ssh
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:smtp
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:domain
ACCEPT     udp  --  anywhere             anywhere            udp dpt:domain
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:http
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:pop3
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:imap
ACCEPT     udp  --  anywhere             anywhere            udp dpt:imap
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:https
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:mysql
ACCEPT     tcp  --  anywhere             anywhere            tcp spt:mysql
ACCEPT     tcp  --  anywhere             anywhere            tcp dpts:terabase:samsung-unidex
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:ndmp
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:dnp
LOG        all  --  anywhere             anywhere            LOG level debug prefix `DROPPED = '
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:ftp-data 

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
ACCEPT     tcp  --  anywhere             anywhere            tcp spt:ftp-data dpt:ftp-data
ACCEPT     tcp  --  anywhere             anywhere            tcp spt:ftp dpt:ftp

Lovely that it has all those ports and whatnot opened up, but what’s with the ACCEPT policies?

Webmin: Now with FAILWALL management!

I should have been in marketing.

Anchr 2.0

Wednesday, November 4th, 2009
We heartily endorse this event or product!

We heartily endorse this event or product!

Anchr 2.0 makes you want to reach out and touch it; hold it; feel it. Your Anchr 2.0 pulsates with a reassuring rhythm, like that of a heart, but made of silicone instead of striated cardiac muscle.

Anchr 2.0 responds.. it is alive. If you listen carefully you can hear its machinations, at speeds beyond the limits of human ken. Don’t Panic – this is normal, but a helpful voice is always close by when you need it.

Anchr 2.0 is not made, but created. Observe its perfect finish and seamless form. The dull blue glow of security, punctuated by the cerice of backups. Anchr 2.0 fits snugly in the hands. Firm, but also yielding, you cannot discern the boundary; that is the sensation of redundancy. It is comforting.

Anchr 2.0 is communal, it is shared. But! A duality of nature: There is one, but there are also many. That is your Anchr 2.0; there are many like it, but that one is yours.

Anchr 2.0 is… everything you love about webhosting, with less crap

Envy our new Leviathan!

Monday, October 19th, 2009

Our current rdiff and amanda backup server, KRAKEN, is almost full, so it was time to order a new one. After much wrangling, we finally received LEVIATHAN this morning.

LEVIATHAN is, I assure you, teh hardk0rez - dual xeon 5500-series, 6gb RAM and 12TB usable storage in RAID-10

LEVIATHAN is, I assure you, teh hardk0rez - dual xeon 5500-series, 6gb RAM and 12TB usable storage in RAID-10

I was pushing for PHYREXIAN DREADNOUGHT personally, but LEVIATHAN is acceptable too; the upkeep effort of backup servers is pretty high after all.

New dedicated server upgrade offering

Saturday, October 10th, 2009

This is, of course, a fantastic idea:
http://en.gentoo-wiki.com/wiki/Using_Graphics_Card_Memory_as_Swap

Anchor loves to stay abreast of the latest performance options. As such, we’re proud to announce a new range of upgrade options for our dedicated server customers that demand the absolute best in performance for their customers.

It makes sense, really. The best our current systems offer is puny DDR2 memory. Just think of what you could do with several gig of GDDR5. That’s right, FIVE! We’re now offering upgrade options with Geforce 320 and Geforce 340 cards. If you order one of our higher-specced (2RU) dedicated servers, you can have two of these puppies strapped together for insane amounts of swappiness.

Stay tuned for more news on how we’re rolling out ButterFS, phase-change cooling, overvolted Core2 Quad servers, and mass-scale SSD RAID-0 arrays for database optimisation.

Interesting failure modes, episode 2501

Monday, October 5th, 2009

I got woken up by a SMS for low diskspace the other night on one of our customer’s servers. Okay, so that’s a lie, I never sleep, but the SMS is real.

Oh great, they’re making whoopie on their mailing lists again and making some stupidly huge logfile.

Little did I know just how huge that file was. How about 735gb huge, in the space of 12hrs? This customer is already a bit of an oddball, what with 1.4TiB of usable space in their server. “Oh that’s nothing”, you say. Sure, I’ve got a few TiB of kitten pictures on my machine at home, just like you, but to put things in perspective: 300GiB of space would be “big” for most Anchor customers. SCSI disks cost about $1.70/Gb, compared to about 10c/Gb for SATA.

There was no mailout. No big processing job, and no flood of activity. With a little digging I was able to nail it down to an apache errorlog file. That was a surprise, except for the PHP errors all throughout – some things never change.

[Fri Oct 02 02:39:57 2009] [error] [client 63.82.71.139] PHP Warning: fgets(): supplied
argument is not a valid stream resource in /home/wright/public_html/script.php on
line 15, referer: XXX

Nice work there, guys. You need to learn to check your return values from failure-prone functions.

Strangely, there were no actual active connections, but the process list showed two apache processes going balls to the wall, writing the same error message to the log file ad infinitum. By my reckoning that was over 9000 lines per second – nothing a quick service-restart couldn’t fix, thankfully.

And to actually fix the problem? It’s tempting to dump the file, but we don’t like doing that; it’s just a bit too cowboy for us. I settled for a forced logrotate run, taking about 4hrs and squishing it down to just 4.3GiB – Crisis (and sleep) Averted.

Site links
Anchor
Wiki
Blog
Services
Domain names
Web hosting
VPS
Dedicated Servers
Co-location
Articles
Dedicated Server Purchasing Guide
Dedicated Server Tutorials
Developer Friendly Hosting
Useful Tools