Exciting news from LCA miniconfs

Published January 17th, 2012 by Barney Desmond

Florian Haas gave a talk yesterday at the HA miniconf to present Flashcache, a project that was spawned from Facebook and their desire to squeeze more performance out of their databases.

The basic concept is to use any SSD device as a cache in front of slower rotational media. This is similar to commercial products such as LSI’s Cachecade, but implemented as a linux device-mapper module (so you wouldn’t be able to boot from such a setup, but that’s unlikely to be a real concern).

One of the nice things about Flashcache is that it’s presented as a plain block device. As well as making for a robust and understandable system, a practical upshot of this is that you can also replicate your cache with DRBD. In large HA database setups, this would mitigate a lot of the cache warmup penalty that you suffer after a reboot or failover event.

Flashcache is also fairly configurable, and exposes a lot of stuff through procfs rather than being a black box.

At the moment you have to build it as an out-of-tree module, so of course it’s not the kind of thing we’ll be rushing into production any time soon. Based on what we’ve seen in the past, I reckon there’s a good chance we’ll see Flashcache in mainline in a year or two if there’s a concerted push on development.

0
Comments

It came from beneath the raised floor

Published January 17th, 2012 by Barney Desmond

Yes, it’s another post about datacentre horrors. I know what you’re thinking: “Yeah yeah, I’ve seen the one about the cabling“.

Yeah well I used to be a datacentre technician like you, then I took a PCI-slot shiv in the knee.

(Edit: Hrm, it looks like the owner nuked the gallery but the files still exist. You can try the Google cache, or a copy that was nabbed.)

0
Comments

Ya gotta admire the chutzpah…

Published December 17th, 2011 by Davy Jones

It’s no secret that here at Anchor, we’re not huge fans of the level of support you get from most commercial software vendors. But a recent incident with a certain vendor of crappy hosting management control panels really took the cake…

It all began, as these things do, on a sunny spring morn. The ticket came in, saying “the control panel says our licence is invalid or expired, even though we paid for a new licence a couple of months ago”. As this tends to cause customer-facing outages, it was a fairly important problem that needed fixing.

(Sidenote: Is it really such a clever idea to run a piece of software that has a feature that is deliberately designed and intended to stop the software from working at the deranged whim of the monkeys who sold it to you? I think not)

Digging into the problem, we could find no obvious cause of the fault — firewall open, packets flowing, manual renewal of the licence via the little button in the web UI seemed to work… all very strange.

As the problem adversely effected the customer’s ability to continue to provide money to the vendor, we thought the vendor might be somewhat keen to help rectify the problem, so as to ensure the ongoing supply of said money. So, we contacted their support department.

“You don’t have an extended super-dooper-bend-over-and-take-it support plan; please pay us $90″, replied the support department, with ‘nary a “how do you do” to soften the blow.

“But wait, we’re trying to ensure the customer can continue to pay you money!”, we replied, on the assumption that the support drone on the other end of the e-mail program was just functionally illiterate (isn’t it great the level of service you get for your money)

“We know. We don’t care. Pay up.” was the curt reply.

Well, doesn’t that just obtain the baked goods. In order to get assistance with paying them money, the vendor wants us to pay yet more money. The logic defies all attempts at analysis or explanation.

Posted in WTF

 Leave a comment

0
Comments

Shining some sun on the cloud

Published November 5th, 2011 by matt

Being the cynical, hard-bitten sysadmins that we are, we’re a bit skeptical about some of the more grandiose claims about cloud computing: 100% uptime, never having to worry about scalability, and all those other things that people who don’t understand reality seem to get terribly excited about.

It’s good to see every now and then that someone else has an experience that matches our own, such as Mixpanel’s decision to move off Rackspace’s cloud and onto dedicated servers. I’d love to know how to negotiate 50%-75% off a vendor’s list price, though…

Posted in WTF

 Leave a comment

0
Comments

cPanel University?! You’re doing it all wrong!

Published October 21st, 2011 by Keiran Holloway

Today I was minding my own business at my desk when I stumbled upon university.cpanel.net, a site which allows you to obtain “industry certification” for the cPanel Web Hosting Manager.

The first thing I did was check the date; It’s not April 1st.. So I sat there stunned for a minute or two, wondering if I should laugh or cry.

Upon further inspection, it actually seemed to be true. You can now go and do an online course and become a certifed cPanel technician!

For anyone who has done business with us in the past, we don’t make too much of a secret that we don’t think too much of control panels such as cpanel or plesk. In fact, we’ve quite openly published our thoughts on this in the past.

That said, trying to think about this a little, I’ve got to ask myself the question — “If you’re building a web-based interface which is designed to allow end-users to control their web-hosting service, then surely expecting certification is doing it all wrong?

Whilst digging further, the actual value of this certification is admittedly some what questionable:

- The first level testing consists of a total of 18 questions, takes 15 minutes and you need to get 15 of the questions right.
- You can continue to re-take exams if you fail
- They can’t actually supply any technical theory or text books
- The advanced levels of training require you to be proficient in perl — surely if you need to use a programming language to configure your “easy-to-use” control panel, you’ve pretty much missed the point.

As we’ve discussed in the past — cpanel significantly and drastically reduces the barrier of entry to becoming a hosting provider. It allows people who would otherwise not be capable nor qualified to run a fully fledged hosting company and hide behind the pretty exterior of the cpanel user interface. This is scary. Why? some of the approaches and methods which are used by cpanel are considerably questionable.

Some of the observations which we’ve made include:

- Installing cpanel is like a unix security evolutionary throw back. A newly built machine had an extra 12 processes running as the root user.
- The security history is so poor that it has a “Scan for Trojan Horses” dialog page.
- There is no inbuilt firewall management utility, yet it is quite keen to change handcrafted firewall rules added by hand
- MySQL is compiled without SSL support
- The update dialog page has people have to chose between 4 different update sources — instead of just one which works.
- http is run as the nobody user
- It entirely ignores the Filesystem Hierarchy Standard and stores most files under /usr/local/x
- If you want to add an SSL certificate for a subdomain that isn’t configured, when you paste the certificate file in, cpanel will successfully parse the cert, extract the correct CN, and map it to the correct user. But when you then paste the key and submit, it’ll bomb saying the CN doesn’t exist. If it doesn’t exist, how did you manage to find a user???
- It actually comes with /scripts/fix_common_problems

Having courses which explicitly train people up to this level and little further is, to my mind, a grave misgiving. It suggests that anyone can spend some coin on an online test and become sufficiently proficient enough to comprehensively run a entire web hosting company. Speaking as someone who has had 7 years experience in this industry, providing web hosting services is more complex than simply doing a handful of online tests and installing some random piece of software; doing it well requires the backing of a intelligent, experienced and knowledgeable team of system administrators. Thinking that any piece of software can replace this is not only naive, but a school of thought which potentially leaves the web-hosting industry, as a whole, to be brought into disrepute.

Tags: ,
Posted in WTF

 Leave a comment

1
Comment

Happy International SUIT UP Day!

Published October 13th, 2011 by Keiran Holloway

For one day of the year, all the Anchorites put away the board shorts and flip-flops, to celebrate in style with Cheap Suits and Expensive Scotch!

In honour of Barney Stinson from How I Met Your Mother, a whole bunch of sysadmins at Anchor went to great lengths to wear a suit to work today! (A certain employee was caught off-guard yesterday and had to purchase a new one)

We know it’s real, because we saw it on the internet

The Anchor team all schmicked up!

Tags:
Posted in FTW, WTF

 Leave a comment

0
Comments

Why PostgreSQL?

Published October 2nd, 2011 by matt

In our never-ending pursuit to bring the Joy of PostgreSQL to the entire world, I’d like to recommend this blog post which summarises a lot of PostgreSQL’s useful features. Consider it some nice, light Sunday reading.

Posted in WTF

 Leave a comment

0
Comments

“dmesg is like a standup comedy routine”

Published August 3rd, 2011 by Barney Desmond

It happens to the best of us. Like the coyote running headlong off the cliff into open space, you blink a few times, feeling around for a handhold. You’ve just run what should’ve been a pretty safe command, but something’s just not quite right…

And so it was the case for me the other day. In prep for migration of an old desktop-chassis server to a KVM guest, I was zeroing out some blocks for the filesystem on the destination.

dd bs=1M count=1 if=/dev/zero of=/dev/nemo/root

I should note that this really isn’t a good idea when your big shiny VM server is named “nemo”. Barely seconds later, the ever-vigilant nagios was reporting that nemo’s disk was full, and that there were only “-” inodes left. nemo was more sure of itself, with all 64Z of the diskspace used. At least there was apparently 8.4 millibytes left, room to swing a lolcat.

I panicked a bit and had to consciously unclench my jaw muscles. Then I panicked some more. Linux, the tough bastard that it is, was running fine for the most part and colleagues wasted no time SSHing in.

“omg wtf”

“dude what did you do??”

“lol! dmesg is like a standup comedy routine”

The VM guests were ticking over just fine, but nemo was rather unimpressed with this incursion and wouldn’t stand for it one bit.

EXT3-fs error (device dm-0) in ext3_reserve_inode_write: IO failure; Aborting journal on device dm-0.
EXT3-fs error (device dm-0) in ext3_dirty_inode: IO failure; ext3_abort called.
EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted journal; Remounting filesystem read-only
EXT3-fs error (device dm-0): ext3_readdir: bad entry in directory #2457831: inode out of bounds – offset=0, inode=2457831, rec_len=12, name_len=1
EXT3-fs error (device dm-0): ext3_readdir: bad entry in directory #2457635: inode out of bounds – offset=0, inode=2457635, rec_len=12, name_len=1

It was undeniably bad, but not beyond repair. Just how bad could it be..? The guests are okay, but we need this fixed soon. We thought quickly. Solutions came, but they were all long shots. It’d be S-rank ninjawork if you could pull it off, but it’d be ugly.

  • Attempt to create an identical filesystem, with the same parameters, and steal the first meg? You’ll miss some data, but a fsck might get it consistent again?
    No, too unlikely to work and dirty as all foo.
  • Ooh, mkfs.ext3 is deterministic. You can calculate where the backup superblocks are on the disk and copy one of them out. You’ll probably lose a block group but it’s something work with, then fsck to tidy up?
    That sounds pretty awful, and will fail if the volume has been resized.
  • If you could just dig the superblock out of memory…
    No!

We eventually settled on a restore from backup of the root filesystem, while everything was live. It goes a little something like this:

  1. The machine’s got 64gb of RAM, might as well use it. Luckily our tools are okay and bits of the filesystem structure are cached. Start by making /tmp pretty huge. We could’ve tried to carve out a new LV, but the LVM tools were crapping out because the filesystem was now read-only.
    mount -o remount,size=12000m /tmp
  2. We need a filesystem to restore to, there’s only a couple gig of data. A loopback-mounted file will do the job. Have to use dd to create it though >_<
    dd bs=1G count=0 seek=10 if=/dev/null of=/tmp/recovery_fs
    mkfs.ext3 /tmp/recovery_fs  # this bit is super fast :)
    mkdir /tmp/recovery_target
    mount /tmp/recovery_fs /tmp/recovery_target
  3. Now we go ahead with the restore. You do take backups of all your data, don’t you?
    One point to note here is that we skip mountpoints like /proc, /sys, /dev, /tmp, etc. from our backups. After the restore, you have to create them on the target, otherwise you might have a rough time booting up again.
  4. We’re just about done, but the filesystem is too big for the destination volume. We’re going to shrink it down, which necessitates a little more work.
    umount /tmp/recovery_target
    e2fsck -C0 -f /tmp/recovery_fs  # the filesystem *must* be fsck'd before shrinking
    resize2fs /tmp/recovery_fs 5G   # the destination is 6gb, so we're being very generous on sizes
  5. And now the master-stroke, splat this new, known-good filesystem straight over the one I touched inappropriately earlier.
    dd bs=1G if=/tmp/recovery_fs of=/dev/nemo/root
    5+0 records in
    5+0 records out
    5368709120 bytes (5.4 GB) copied, 3.97025 seconds, 1.4 GB/s
  6. I wish we could get disk performance like that all the time. Finally, remount the filesystem and see what happens…
    mount -o remount,ro /dev/nemo/root
  7. Now the system was well and truly confused, which we’d anticipated. Not content to unceremoniously damage the root filesystem, we’d just completely pulled the rug out from underneath it (and replaced it with a fine Persian carpet!)

Well, the VM guests weren’t quite in production yet… Really, a reboot was going to be the cleanest way of getting all this sorted, and it only takes like five minutes.

With a well-timed kick in the still-blissfully-unaware VMs my colleagues brought them to a quiescent state as I frobbed the virtual power switch to trigger a reboot (remote server management cards are the greatest thing since Netscape’s fishcam). As awesomely as that turned out, I’m not looking forward to doing it again.

0
Comments

Stay-at-home servers

Published November 2nd, 2010 by Barney Desmond

So this is a bit old, but it’s been kicking around my bag of links for a little while. Who said Microsoft didn’t know how to have a little fun?

You can even get a deadtree copy from Amazon. If that doesn’t do it for you, there’s apparently PDFs to be had as well.

0
Comments

Why web developers don’t make good system administrators

Published October 15th, 2010 by Keiran Holloway

Straight off the bat I would make something clear:  I have a lot of respect for software and web developers.  Being able to write clean, intelligent and efficient code is certainly one of the more difficult aspects within this industry. With this in mind, I think that anyone who is able to write a consistently high level of code based on often sketchy requirements and delivering this within the usual time pressures of business should be awarded some kind of medal.

That said, I can say with some confidence that we have the pleasure of working with some of the very best software and web developers both locally here in Australia as well as abroad.

Further to this, I can also add quite unreservedly that software developers really don’t make good system administrators.. And can you really blame them?

Allow me to elaborate a little bit here; As you may have already guessed from the above few paragraphs, software development is tough.  Being a good software developer is even tougher. Under the pretty exterior of most websites there an awful lot of work that goes into making the sites work.  Pulling this together requires a fair amount of consideration through-out all aspects of the software development process, from getting requirements and designing the application through to writing the code, testing, debugging and forever trying to squash that final elusive bug.  It takes someone with a fairly specific skill-set to be able to do all this and to do it well.

Something that I’ve noticed however, is software developers are sometimes expected to take on the role of server management and look after the on-going running and maintenance of the machine.  Whilst I can appreciate there’s a similarity between what a software developer and a system administrator does, “hey, they both do ‘computer stuff’”, the tasks which are completed by each roles are worlds apart.  A software developer really only cares about getting his or her application working within a specific environment the quickest way possible.  This can sometimes mean that there are some rather drastic changes to the machine configuration with little consideration to the potentially negative implications. This is pretty understandable,  as far as they’re concerned, once they get the environment working with their application then they can just continue hacking away on their code.  Given they are probably under other tight deadlines or would just simply be preferring to get on with what they’re actually being paid to do without much consideration for the longevity and maintainability of the operating system environment.

This is something we see a lot of; from developers downloading source tarballs then compiling and installing software system-wide to running bleeding edge versions of software which just aren’t suited to being in production.

To give an example of an incident recently which has prompted this post, we had a client call up complaining that they couldn’t get their postgresql database to start. Whilst this was not on our fully managed service, we are always willing to help out or clients on a professional consulting basis.  Upon logging in we attempted to start postgresql and witnessed it failing without too many clues as to what’s doing on. Further investigation revealed the following in the postgresql startup logs:

FATAL:  database files are incompatible with server
DETAIL:  The database cluster was initialized with CATALOG_VERSION_NO 200812281, but the server was compiled with CATALOG_VERSION_NO 200904091.

Further digging revealed that postgresql had recently been updated.. 14 hours ago to be precise. Subsequent to this the database engine had been stopped and then failed to start again. The client in question actually uses this machine as a mail exchange for his clients and uses a postgresql back-end to manage the mail tables.  This means that for the duration of the outage, no email was working for any of the clients on the machine.  Yes, for 14 hours.  Ouch.

Once we had found the problem, all we needed to do was roll back to the previous version start up postgres and everything would be hunky-dory, right? Well.. Easier said than done.

In this case, the software developer had installed what appears to be a development version of postgresql which was (as the error message alludes to) released in January 2008.  That’s ok, we should just be able to reinstall the previous version from the RPM on the machine, right?  Wrong. Didn’t exist.

At this point in time we started to do a quick google and checking the postgresql website to see if they perhaps, just maybe, had a copy of this daily development release somewhere on the website.  No joy there…

I know! We take backups for any clients who chose to use our managed backup solution, and this client has opted for this service!  As part of our managed backups we roll-out an automated process to take a dump of all the databases and store locally on the disk!  Given this happens at midnight each night and the database stopped running at 8pm we’ll just be able to restore from the database dumps right?  Wrong.  We didn’t install postgresql and there is no process in place to do this.

So at this point in time, the dataset was still there but effectively useless and mail services were still down.  Fortunately, we were able to save the day by restoring all the binary files from this specific version of postgresql from backups and thus restore services for the client.  Whilst the motivation behind using this specific version is unknown, the software developer has since moved on and there is zero documentation.  This situation really shouldn’t have happened in the first place. This type of problem is actually something that we see more often then you would imagine.  We often have developers requesting specific versions of software to use in a production environment.  Obviously, we would strongly, strongly discourage the use of development versions within production (they’re called DEVELOPMENT versions for a reason, they simply haven’t been around long enough to be considered stable, reliable software). However, from time to time a specific feature or bug fixes within a specific development version which dictates we must install such a version.  This is something we can certainly get working…  And, most importantly, keep the machine in a maintainable state! This means having supporting documentation as to the decisions made as well as making sure that routine maintenance tasks will not break the existing, carefully crafted configuration.

I also have another fond memory of a web developer who was having some niggling problems with tomcat and permissions and figured that the best way to solve the problem was using:

chown tomcat / -R

So, it got the web application working, but broke virtually every other service on the machine. Can anyone say hosed file system permissions?

…Or how about the Windows machine which has 4, yes, 4 separate instances of MSSQL installed on it..  I digress.

Without wanting to turn this into a big marketing spiel, it is important to keep in mind that like software development, system administration can be a tough game too.   Obviously in the above examples using hind-sight we can easily identify the problems in what was done previously on the machines.  That said, at Anchor we are a team of system administrators who have been running complex systems for a long time now and have the experience to make sure that all the appropriate precautions are taken to make sure we don’t end up in these situations above.

Further to this we have numerous systems in place to pro-actively check services including database servers, 24/7. In the event of failure both audible and visual alerts are generated with notifications outside of hours being sent via SMS message service. Even in the event that this happened on a fully managed machine it would never have resulted in 14 hours down time.  All said, I am not just trying to blow our own horn about how fantastically brilliant we are (ok, maybe, just a little), but what I am trying to get across is system administration is something that really requires an all or nothing attitude towards. If your website or associated hosting infrastructure is critical to your business’ success then making sure the commitment to system management is commensurateable is absolutely imperative to success. Either through outsourcing via our fully managed support pack or by hiring a dedicated system administrator.  There really is no place for laissez-faire and utilising a software developer part-time for this role is only likely to cost more in the longer term.

2
Comments