AUTOMATE ALL THE THINGS!

Published August 30th, 2011 by David Basden

Illustration © Allie Brosh of Hyperbole and a Half

You can’t walk two metres down the street without someone going on about how cool and hip “The Cloud” is these days, being able to spin up hundreds of identical Linux VMs easily. Tools to build and configure lots of identical systems or VMs are plentiful.

But what if there is no “standard build” or even anything close, with different hardware, networking, software, distro, services, firewalls etc. every time, but you don’t want to spend all your time doing custom server builds and configuration?

Being a provider that specialises in customised hosting solutions, not only do most of our server builds have custom requirements, but we also have to configure lots of our own internal systems to deal with the eccentricities of each new server that we bring up. We have systems for monitoring, notifications, backups, fire-walling, accounting, scheduling of system updates, and that’s only a sampling of them.

We heavily use tools like puppet to automate things, but with almost every server being different, even configuring the tools designed to automate configuration starts taking serious time. For us, building a custom system and customising the systems that support it can take one of our sysadmins the best part of a day.

To solve this problem we’ve written some software that will take vague hand-waving from sales people[0] and turn it into a deterministic set of build steps that can be automated, without a sysadmin being involved at all. We now have heterogeneous server builds that are fully automated, and spun up in minutes rather than days. Rather than yet another bland VM copy, you get a beautiful and unique snowflake of awesome.

Even more fun, we’ve released them all as free software under the modified BSD license.

The first tool is called make-magic. Its job is to turn the high level requirements (I want a webserver and I want it blue) in to a detailed set of steps that can be automated (Find an online-paint shop. Buy some blue paint. Get it shipped to the data centre. etc.).

The core of make-magic can be guaranteed to always output a valid, correct list of steps for any set of input requirements. It even keeps track of which ones are done, which steps require others to be done first, and can let you know which ones can be done all at the same time.

There are some really interesting underlying problems that come up when you have to automatically generate a really specific, and provably correct set of steps from hundreds of thousands of permutations based on some high level requirements. We’ve beaten our heads against those brick walls, put it in a box, and posted it on github. You can find it at https://github.com/anchor/make-magic

Fun as it is, we still don’t want to have to go through and do those steps ourselves. This is where the second tool, mudpuppy, comes in.

mudpuppy is a Python based automation agent/client for make-magic. It allows you to write independent ‘modules’ in Python code, to automate items in a make-magic task. When run, mudpuppy will poll the make-magic server for tasks that need to be done, and will automate any items which it has a module for. Once the module has run successfully, mudpuppy will tell make-magic that the item has been completed, and will look for more work to do.

You can find mudpuppy on github at https://github.com/anchor/mudpuppy

mudpuppy isn’t the only tool that can be controlled by make-magic. make-magic has a simple, well defined JSON based HTTP API that mudpuppy uses, and which should be easy to talk to with most languages.

Earlier this month, Chris wrote about Orchestra, a set of tools for fast, reliable remote job execution on remote servers. mudpuppy is able to talk directly to Orchestra and get it to spin off jobs, get back the results, and use it to update information in make-magic. No extra parts required: They talk to each other out of the box.

As mentioned in Chris’s post, you can find Orchestra on github at https://github.com/anchor/orchestra

The combination of make-magic, mudpuppy and Orchestra (plus puppet, whatever else you might have lying around, or anything you’ve always wanted to try) give us the ability at the end of the day for us to build custom servers faster, reliably, and with ludicrously more flexibility than with any other systems that automate server builds that we’ve come across. Hopefully you’ll find them useful too.

If you’re interested in learning some more, Chris and I will be talking at the Sydney DevOps meetup on the 15th of September. There’s more details at http://devops.meetup.com/cities/au/sydney/

[0] Actually our “sales” people are pretty technical themselves, so they tend to wave their hands in the right direction.

1
Comment

Supermicro hardware up for grabs

Published August 19th, 2011 by Barney Desmond

In the last year or so Anchor has been making a strong push towards energy-efficient Dell hardware and VPSes. As a result, we find ourselves with a collection of older Supermicro servers on our hands that still perform well, but aren’t viable to sell to customers any more.

Rather than scrapping them and foisting more e-waste on the environment, we’d prefer to send them to a good home where a geek or hacker can make use of them. If this sounds like you, we’d love you to have a gander at our ebay listings.

To whet your appetite, here’s a sample of what you can expect.

Supermicro Nyanserver 6015P-8R

Supermicro Highroller 6014H-T

Feel free to get in touch if you have any questions about the hardware.

N.B. Nyanserver may not be as nyan as shown, photograph is for demonstrative purposes only.

0
Comments

Devs and SysAdmins – It is possible to live harmoniously!

Published August 17th, 2011 by Keiran Holloway

Historically, the battle lines were drawn and everyone was bracing themselves, ready for combat..

On one side, you have the Devs wanting to make a change to a production server immediately to get their new shiny feature working and the SysOps on the other side, desperate not to give an inch for fear of upsetting the uptime gods.

Each party thinking to themselves: “does it really need to be this hard?” .. “Why doesn’t the other get it?”

The Times They Are a-Changin’ and with it, a new fan-dangled word:

DEVOPS

wait! … wtf??! Devops??! Who?! What?

Well.. Depending on who you ask, it means any number of things, my view is it comes down to a combination of attitude, cultural and process changes which need to be applied to both system administrators and developers alike. Forming the super-creature…. a “DevOp”. Through this combination it will allow people working on either aspect to operate in a smarter, more collaborated fashion and achieve outcomes which are best for business.

At Anchor, obviously being a group of professional sysadmins who manage public-facing websites factors like reputation and uptime are everything; meaning that we feel these pressures essentially on a day to day basis. That said, we do have the pleasure of working with some of the best developers around and we’re awfully keen to embrace any changes which allow us to support these people better.

With this in mind we routinely are actively involved in local DevOps meetups here in Sydney — coincidently there’s a meetup on tomorrow night, held at the orient in the historic rocks area of Sydney. With this in mind we would encourage anyone who is a dev, an op or a devop to drop in to one of these meetups and chat with like minded folk.

On a slightly different, but still related note, Anchor’s Benjamin Smith will be presenting at the annual PyCon conference, being held in Sydney this weekend at 10:20am Sunday. The topic of the talk is: “Sysadmins vs Developers, a take from the other side of the fence”, whilst registration has closed for this conference, this talk will be recorded and presented on Blip.TV (URL will be provided once available) with the slides from the talk being released via our website.

0
Comments

Greening our office

Published August 16th, 2011 by Davy Jones

Earlier this year we moved into a larger office to give everyone a bit more space. Knowing we’d be here for at least the next 3 years we decided to see what we could do to reduce the impact of our activities.

Energy Assesment

This was our first step, to know where power was being used:

The section titled “Office” represents usage from power outlets, predominantly office workstations and internal servers. We still haven’t found where our shop is, but according to our energy auditor it only uses 0.4% of power so we didn’t look too hard for it.

In NSW, the Energy Efficiency for Small Business Program can assist with the cost of energy assessements as well as contributions of up to $5000 towards the cost changes that you make to reduce your energy consumption.

100% Green Power

Our daily power consumption when we first moved in was over 200kWh/day, that’s nearly 10 times the typical Australia home. Whilst we took steps to reduce this we couldn’t eliminate it so we purchased 100% Green Power.

Lighting

The office started with 140 fluorescent tubes, of these we disabled around 50 tubes in areas that either received adequate levels of natural light, weren’t being used, or the people sitting below them were creatures of the dark. In many fittings we found that one of the two tubes provided an adequate and sometimes a more comfortable level of light for the area being lit.

The remaining 90 or so we replaced with Phillips LED tubes reducing power consumption by 50%.

Turning off workstations

Everyone now either turn off or put their office workstations into sleep mode (most of the time). We built a script to enable reporting on which workstations had been left on for enforcement.

No bottled water

Having someone deliver water on the back of a diesel powered truck when we had a tap in the office seemed quite counter-intuitive. We realised most of what we enjoyed about the water dispenser was that it was cold and filtered. We replaced the water that was previously being delivered by truck to a plumbed dispenser with a water filter.

Re-usable take away coffee cups


 
We bought a few hundred of these from Keep Cup, gave most of them away to our customers and kept the rest for office use.

Indoor plants

Our initial attempts at office plants was somewhat of a failure, most of the plants died due to lack of water. So we automated the plant supply and watering, 6 months on our new batch of plants are all still alive.

Other changes we’re considering:

  • Gradual shift to increased use of low power workstations or laptops over traditional workstations
  • Installation of a power monitoring device such as a Wattson or SEGmeter
  • Talk to our landlords about reducing water consumption in bathroom amenities

Tags: ,
Posted in FTW

 Leave a comment

0
Comments

Resonance Cachecade

Published August 15th, 2011 by Barney Desmond

We don’t normally post about hardware wankery, but this little piece of shininess appeared for free in some of the newer Dell servers we’ve been ordering, and it actually sounds like it’s not an awful hack.

Cachecade is an LSI technology (Dell PERC cards are rebranded gear) that adds a read-cache tier to the RAID logic, in the form of solid-state disks. While SSDs are still too expensive for mass-scale primary storage, they’re cheap enough that you can burn a few hundred bucks and get 50gb worth of faster reads.

The real benefit of this style of read-cache should be for random block reads, where SSDs proverbially drop excrement over rotational media from a great height. The jury is still out for us – we’ve just started using cachecade on a couple of VM hypervisors and a customer DB server, but we’re hoping to see some noticeable impact even on a qualitative basis.

In truth, the performance improvements will be difficult for us to quantify on our own workloads. You can apparently get this functionality if you purchase the new LSI® MegaRAID® CacheCade™ Pro 2.0, but I’d bet that it’s not exposed through something sane (like SNMP) and you’ll be forced to use the perennially-awful MegaCLI tool to get at the data.

1
Comment

Awesome but often unknown Linux commands and tools

Published August 10th, 2011 by Keiran Holloway

I’ve been working in this industry for a while now and naturally spend a lot of time using Linux on a daily basis. This gives lots of exposure to various Linux commands and tools. That said, I am sometimes surprised when I see, often very experienced system administrators, using somewhat convoluted commands to do something relatively simple using a different tool.

This is my opportunity to share some of these experiences:

1. pgrep and pkill – The first command ‘pgrep’ will return the process id based on a name or other attributes. pkill will signal a process with a matching name or attribute. Want to kill all processes being run by a given user? Issue a pkill -U USERNAME; sure beats the hell out of:

ps -U USERNAME|awk {'print $1'}|xargs kill

2. htop – Much like your regular ‘top’ command, on steriods. Gives an interactive view on your machine right now, but with an ascii visual representation of your CPU, memory and swap utilisation. Often not installed by default, but is packaged under both Red Hat Enterprise Linux and Debian and can be trivially installed on most dedicated and virtual private servers.

3. mytop – Similar to top, but designed specifically for MySQL. Gives you an interactive display of what is happening with your MySQL database, in real time. Information such as what queries are being run, amount of data which is being flowing in and out of the database, number of queries being run per second and number of slow queries. This application is once again packaged with most large common Linux distributions.

4. lsof – This cool little command comes with most Linux Distros as default and allows you to display any files which are currently opened on the system. Especially handy for finding files which have been deleted (and not appearing in the file system) but still resident due to being held open by a current running process.

5. iptraf – Want to know where all your network traffic is going to and coming from? iptraf is a really cool tool which can be used to track this information and let you know what is happening on your server.

I hope this information helps. Got some which I’ve missed? Please leave a comment and make this article more useful!

Anchor is a world-wide leader in providing comprehensive support on dedicated servers and virtual private servers both in Australia and around the world. Speak with our team of system administrators to find out how we can help you: support@anchor.com.au or just read about how we built Github

5
Comments

The Automation Waltz

Published August 5th, 2011 by Chris Collins

When you have a bunch of machines involved in a process, you need to ensure that various stages in this process have executed. If the target host is unavailable, you want a guarantee that the job will execute when the host becomes available again.  This is well beyond the capabilities of ssh in a for-loop.

In trying to solve this problem, we had assessed tools like mcollective, but came to the conclusion that they were inappropriate for our environment.  mcollective in particular was removed from consideration as it was designed for a more homogeneous environment than the one here at Anchor.

When we realised we needed a different solution, a few of us gathered around a whiteboard and started enumerating our requirements.  The result was Orchestra, which we’re releasing today under the BSD License.

Orchestra is a suite for managing the reliable execution of tasks over a number of hosts. It is intended to be simple and secure, leaving more complicated coordination and tasks to tools better suited for those jobs.

Because everybody loves this web development stuff, it was developed to provide an interface that operates asynchronously in relation to the
execution of the work being done, allowing for cheap and easy polling of job state.

And last, but not least, because having critical system services depend on potentially fragile language interpreters or their libraries is generally a really bad idea for reliability, it was completely implemented in Go - a type-safe static language that compiles to native code and includes many features derived from dynamic languages. Despite this, the work units it executes for you can be written in any language you like.

Orchestra itself is far from finished, but it’s working reliably enough that we’re already using it in production in a very limited capacity, and have been slowly extending it’s reach into new areas as appropriate.

We’d love you to take a look and see if it works for you, we’re open to suggestions and contributions for improvement. If you’re after a quick overview, doc/orchestra.tex is a good place to start, and the samples/ directory contains commented config files for the various daemons.

We don’t believe in duplicating functionality, so it’s assumed that your config, SSL certificates and scores are distributed by another automation tool – we use puppet. Utilities like daemontools or god do a great job of keeping your daemons running.

1
Comment

“dmesg is like a standup comedy routine”

Published August 3rd, 2011 by Barney Desmond

It happens to the best of us. Like the coyote running headlong off the cliff into open space, you blink a few times, feeling around for a handhold. You’ve just run what should’ve been a pretty safe command, but something’s just not quite right…

And so it was the case for me the other day. In prep for migration of an old desktop-chassis server to a KVM guest, I was zeroing out some blocks for the filesystem on the destination.

dd bs=1M count=1 if=/dev/zero of=/dev/nemo/root

I should note that this really isn’t a good idea when your big shiny VM server is named “nemo”. Barely seconds later, the ever-vigilant nagios was reporting that nemo’s disk was full, and that there were only “-” inodes left. nemo was more sure of itself, with all 64Z of the diskspace used. At least there was apparently 8.4 millibytes left, room to swing a lolcat.

I panicked a bit and had to consciously unclench my jaw muscles. Then I panicked some more. Linux, the tough bastard that it is, was running fine for the most part and colleagues wasted no time SSHing in.

“omg wtf”

“dude what did you do??”

“lol! dmesg is like a standup comedy routine”

The VM guests were ticking over just fine, but nemo was rather unimpressed with this incursion and wouldn’t stand for it one bit.

EXT3-fs error (device dm-0) in ext3_reserve_inode_write: IO failure; Aborting journal on device dm-0.
EXT3-fs error (device dm-0) in ext3_dirty_inode: IO failure; ext3_abort called.
EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted journal; Remounting filesystem read-only
EXT3-fs error (device dm-0): ext3_readdir: bad entry in directory #2457831: inode out of bounds – offset=0, inode=2457831, rec_len=12, name_len=1
EXT3-fs error (device dm-0): ext3_readdir: bad entry in directory #2457635: inode out of bounds – offset=0, inode=2457635, rec_len=12, name_len=1

It was undeniably bad, but not beyond repair. Just how bad could it be..? The guests are okay, but we need this fixed soon. We thought quickly. Solutions came, but they were all long shots. It’d be S-rank ninjawork if you could pull it off, but it’d be ugly.

  • Attempt to create an identical filesystem, with the same parameters, and steal the first meg? You’ll miss some data, but a fsck might get it consistent again?
    No, too unlikely to work and dirty as all foo.
  • Ooh, mkfs.ext3 is deterministic. You can calculate where the backup superblocks are on the disk and copy one of them out. You’ll probably lose a block group but it’s something work with, then fsck to tidy up?
    That sounds pretty awful, and will fail if the volume has been resized.
  • If you could just dig the superblock out of memory…
    No!

We eventually settled on a restore from backup of the root filesystem, while everything was live. It goes a little something like this:

  1. The machine’s got 64gb of RAM, might as well use it. Luckily our tools are okay and bits of the filesystem structure are cached. Start by making /tmp pretty huge. We could’ve tried to carve out a new LV, but the LVM tools were crapping out because the filesystem was now read-only.
    mount -o remount,size=12000m /tmp
  2. We need a filesystem to restore to, there’s only a couple gig of data. A loopback-mounted file will do the job. Have to use dd to create it though >_<
    dd bs=1G count=0 seek=10 if=/dev/null of=/tmp/recovery_fs
    mkfs.ext3 /tmp/recovery_fs  # this bit is super fast :)
    mkdir /tmp/recovery_target
    mount /tmp/recovery_fs /tmp/recovery_target
  3. Now we go ahead with the restore. You do take backups of all your data, don’t you?
    One point to note here is that we skip mountpoints like /proc, /sys, /dev, /tmp, etc. from our backups. After the restore, you have to create them on the target, otherwise you might have a rough time booting up again.
  4. We’re just about done, but the filesystem is too big for the destination volume. We’re going to shrink it down, which necessitates a little more work.
    umount /tmp/recovery_target
    e2fsck -C0 -f /tmp/recovery_fs  # the filesystem *must* be fsck'd before shrinking
    resize2fs /tmp/recovery_fs 5G   # the destination is 6gb, so we're being very generous on sizes
  5. And now the master-stroke, splat this new, known-good filesystem straight over the one I touched inappropriately earlier.
    dd bs=1G if=/tmp/recovery_fs of=/dev/nemo/root
    5+0 records in
    5+0 records out
    5368709120 bytes (5.4 GB) copied, 3.97025 seconds, 1.4 GB/s
  6. I wish we could get disk performance like that all the time. Finally, remount the filesystem and see what happens…
    mount -o remount,ro /dev/nemo/root
  7. Now the system was well and truly confused, which we’d anticipated. Not content to unceremoniously damage the root filesystem, we’d just completely pulled the rug out from underneath it (and replaced it with a fine Persian carpet!)

Well, the VM guests weren’t quite in production yet… Really, a reboot was going to be the cleanest way of getting all this sorted, and it only takes like five minutes.

With a well-timed kick in the still-blissfully-unaware VMs my colleagues brought them to a quiescent state as I frobbed the virtual power switch to trigger a reboot (remote server management cards are the greatest thing since Netscape’s fishcam). As awesomely as that turned out, I’m not looking forward to doing it again.

0
Comments