Collaborative hydration session tonight

Published February 8th, 2012 by Barney Desmond

Our pals over at Github and Heroku are having a drinkup this evening at King St. Wharf. It’s not strictly our gig, but a few of us will be along to shoot the breeze on sysadminly topics and have a yarn.

More details on Github’s page, come along if you like hanging around hardcore technical people (and beer taps).

Tags: , , ,
Posted in FTW

 Leave a comment

0
Comments

100% FAT-free

Published February 6th, 2012 by Barney Desmond

I wrote some documentation for our sysadmins last week detailing how one should deal with a critical diskspace notification at some ungodly hour of the morning. On the specifics of checking filesystems with the df tool:

“Astute readers will notice that we don’t query btrfs filesystems here; this is because btrfs uses extents, and inodes are a non-issue.”

Well, I wasn’t entirely wrong, but I wasn’t entirely right either.


btrfs is a modern filesystem with lots of shiny new features. It’s definitely not production-ready yet, but like a magpie drawn to shiny things, a couple of us use btrfs on our own machines (it’s what backups are for, right?).

Some time ago I wrote about how an ext filesystem can run out of free inodes and bite you. That happened to me last Thursday, only this time it was btrfs under the hood.


I first noticed the problem when puppet wouldn’t run, saying there was already another instance running. puppet is dumber than a bag of rocks so I pressed on, trying to run aptitude update instead.

root@misaka:~# aptitude update
E: Write error - write (28: No space left on device)

O rly? df disagreed about that. I immediately thought of inode exhaustion, but btrfs isn’t meant to suffer from this problem! To prove it, I touched a few files, successfully wrote some bits, deleted them again – all good.

Their curiosity piqued, my fellow sysadmins cracked open the strace and confirmed what we knew: ENOSPC from the write() call. We were at a loss until someone serendipitously spotted some errors in the syslog:

Feb 2 19:09:31 misaka kernel: [683642.593034] no space left, need 4096, 10694656 delalloc bytes, 696373248 bytes_used, 0 bytes_reserved, 0 bytes_pinned, 0 bytes_readonly, 0 may use 707067904 total
Feb 2 19:09:55 misaka kernel: [683666.684247] no space left, need 4096, 6905856 delalloc bytes, 700162048 bytes_used, 0 bytes_reserved, 0 bytes_pinned, 0 bytes_readonly, 0 may use 707067904 total

A little googling produced a promising bug ticket on Redhat, “[btrfs] hopeless ENOSPC handling and excessive administration costs“.

The short version for our specific scenario is: df doesn’t expose some exhaustion issues because btrfs doesn’t work like a classic filesystem.
This is where you can start moaning about how btrfs is FitH if you’re so inclined, but I like playing with my shiny toys, thank you.


btrfs has its own version of df for inspecting the filesystem:

root@misaka:~# btrfs filesystem df /var

Metadata, DUP: total=95.12MB, used=15.16MB
System, DUP: total=8.00MB, used=4.00KB
Data: total=674.31MB, used=665.52MB       <-- Under 10MB free!!
Metadata: total=8.00MB, used=0.00
System: total=4.00MB, used=0.00

This would explain why I could create files myself, but stuff like aptitude was failing when it tried to write more than several MB. You'll also notice that there's a lot of allocated-but-unused metadata space in the first line of output.

We have a tool to fix this, and unlike btrfsck it's actually usable. We can rebalance the filesystem to adjust the proportion reserved for data. Some commenters on the bugzilla ticket noted that it caused a kernel panic when they ran it, but that was two years ago. It's probably fixed by now...

root@misaka:~# btrfs filesystem balance /var

# Now when we run `df` again...
Metadata, DUP: total=47.56MB, used=15.20MB  <--- Much less allocated
System, DUP: total=8.00MB, used=4.00KB
Data: total=745.38MB, used=665.52MB         <--- Plenty of free space
Metadata: total=0.00, used=0.00
System: total=4.00MB, used=0.00

Mission Accomplished!

aptitude and puppet run fine now, so all is well. As a note, the rebalancing is (subjectively) not fast: it took 7-8sec on that 1gb filesystem.


To wrap things up, I thought I might extend that filesystem a bit, as some more breathing room would be good. The btrfs volume is on an LVM logical volume, so this is a pretty easy task.

  1. Extend the LVM LV by 512MiB
    lvextend -L +512M /dev/misaka/var
    
  2. Grow the btrfs filesystem to fill the newly-enlarged block device
    btrfs filesystem resize max /var
    
  3. Rebalance the btrfs filesystem (optional?)
    btrfs filesystem balance /var
    

Now, I'm not sure whether the final rebalance is strictly necessary. The system's df tool acknowledges the extra size after the resize operation, but btrfs-df shows no change in its output until the rebalance is done. A little testing would be in order, but I'd rather do it on a dedicated testing machine.

Any other cowboys out there using btrfs? Your data may or may not be intact when the sun rises tomorrow, but boy it's exciting!

Tags: , ,
Posted in FTW

 Leave a comment

0
Comments

Channelling your rage

Published February 3rd, 2012 by Barney Desmond

Getting notifications when servers break is always annoying. We use Nagios at Anchor, a very popular solution. “Friggen nagios!” is a pretty common cry.

If you get a lot of notifications in quick succession, your Rage meter starts to build up. When it hits 100% you unleash a special attack and reboot the server.

Rachel's gauge is at 100%, circled in blue crayon. She can now reboot the server with her Static Iris

That’s pretty cool, but it turns out that customers don’t like reboots as much as us, so we looked at ways to reduce the rage. One great way to do this is with better documentation; we call it Ragewiki.


Making use of the notes_url parameter, we provide a link to our wiki documentation directly from Nagios’ web interface. There’s one page for each service, with precise instructions on how to diagnose and fix common problems, as well as a brief description of what the service actually does.

So now when you get that SMS at 3am (PROBLEM – ntype on fundle is CRITICAL), you don’t spend 20 minutes flailing through A Brief History of Time, as told by H.P. Serverbox.


To sweeten the deal a bit, we also allow for host-specific instances of a service, which might need extra-special instructions. We also have a page full of terse legacy documentation that we’d like to fallback on in case the new docs haven’t been written yet. We think it’s a cute little hack so we’d like to share with you.

The possibilities are up to your own imagination, we just went for the most straightforward option. You could always link to a big red button that reboots the server straight away. :)

  1. Give every service a URL in the Ragewiki, using the notes_url argument. We attach this to the generic service template so that every single service automatically gets a link.
    # RageWiki ftw
    notes_url /ragewiki/$HOSTNAME$/$SERVICEDESC$

    You’ll notice that we’ve parameterised the URL so that each host-service pair is unique

  2. Prepare a rewrite map to check for existence of docs
    This URL will refer to the Apache instance on the nagios server itself. It captures the request starting with /ragewiki/, extracts the hostname and servicename, then builds a suitable redirect.

    Because we want to support per-host pages that may exist, we use a RewriteCond and a smart RewriteMap to check whether the page exists, then redirect accordingly. We use moin as our documentation wiki, with HTTP access control in front of that.

    RewriteLock /var/lock/rewrite.lock
    RewriteMap RageWiki "prg:/usr/bin/xargs -n1 -d '\\\\n' /usr/bin/HEAD -sd -H 'Authorization: Basic EncodedUsernameAndPassword'"

    You may want to read up on Apache’s RewriteMap functionality to make sense of this. The short version: it contacts the wiki and returns the HTTP status line for the suggested page. A 200- or 300-series status code is considered a success – the page exists and should be used.

  3. Finally, use the RewriteMap and generate a suitable redirect
    This is a basic set of cascading rewrites, the first success will terminate further processing.

    # Server-specific docs: /servers/$HOSTNAME/$SERVICENAME
    RewriteCond ${RageWiki:https://magic.ponies.anchor.net.au/servers/$1/$2} ^[23]\d\d
    RewriteRule ^/ragewiki/([^/]+)/(.+)$ https://magic.ponies.anchor.net.au/servers/$1/$2 [R,L]
    
    # Whole lotta BGP goin' on (with variable check names, a variant of generic docs)
    RewriteCond ${RageWiki:https://magic.ponies.anchor.net.au/Nagios/Services/bgp} ^[23]\d\d
    RewriteRule ^/ragewiki/[^/]+/bgp[_-].+$ https://magic.ponies.anchor.net.au/Nagios/Services/bgp [R,L]
    
    # Generic docs for normal services: /Nagios/Services/SERVICENAME
    RewriteCond ${RageWiki:https://magic.ponies.anchor.net.au/Nagios/Services/$1} ^[23]\d\d
    RewriteRule ^/ragewiki/[^/]+/(.+)$ https://magic.ponies.anchor.net.au/Nagios/Services/$1 [R,L]
    
    # Catch any checks without docs, and send them to the fallback page.
    # Funky regexes to pass the failed service name through to the fallback page.
    # FIXME: Can we use a positive-lookbehind in these things? Would make it slightly tidier.
    RewriteRule ^/ragewiki/([^/]+)$    https://magic.ponies.anchor.net.au/CommonNagiosServiceCheckReference#$1 [NE,R,L]
    RewriteRule ^/ragewiki/.*/([^/]+)$ https://magic.ponies.anchor.net.au/CommonNagiosServiceCheckReference#$1 [NE,R,L]
    

    Special cases with varied names, like our BGP checks, are easily handled by dropping a custom regex into the chain. It’s best if your service names have a consistent format that can be readily pared back to a basic name, but this method is fine for the occasional odd case.


Too easy! To give you an idea of what we think good Ragewiki docs look like:

  • What servers does this apply to?
  • Summarise what the nagios check is for (one sentence!)
  • What’s the impact of a failure? Customer visible? Websites are down? Etc.
  • A short procedure on how to confirm the notification and diagnose it further
  • A procedure on how to fix it

That’s it; the page should only be a couple of screens long at the most. If you can’t include all the necessary information, it’s best to put it on a separate and link to it. We specifically don’t include information about How It Works because it detracts from fixing problems faster.

Ragewiki works great for us, so we’d be interested in hearing your thoughts and comments. It’d also be cool to know if other people have reached the same goal, but in a different way.

2
Comments

Draft RFC for new 7xx HTTP status codes

Published January 31st, 2012 by Barney Desmond

It’s come to our attention that a proposal for additional status codes has been released.

RFC for the 7XX Range of HTTP Status codes – Developer Errors

We’re most in favour of the 73x series, I reckon one of the guys here could hack up a filter in perl to convert those pesky 500-errors from Rails into something a little more meaningful.

Tags: , ,
Posted in FTW

 Leave a comment

0
Comments

LCA day 4 – On freedom

Published January 23rd, 2012 by Barney Desmond

It goes without saying that Linuxconf is all about free software, as in both beer and/or speech. A number of today’s talks focused on freedom, in the context of access to data and code, and the freedom to use software (and hardware) the way you see fit.

We actually had two great keynote talks on freedom, I’d like to step back to yesterday’s talk by Karen Sandler (you can see the talk for yourself on on youtube, which I’d highly recommended). Karen was diagnosed with hypertrophic cardiomyopathy, a heart condition that means she could suddenly die at any time. Thankfully there are treatments available, one of which is a pacemaker.

Being the person she is, she immediately asked “what software does it run?”. Long story short, the manufacturer ended up stonewalling on the issue, refusing to provide code or further details even with an NDA. Noone had ever asked before, and everything was pushed back with assurances that the devices are safe, and that they’re approved by the FDA.

It might seem like a trivial matter, but it’s a big deal if you step back and consider it. This device is implanted in your body to regulate your heart. In the event of cardiac arrest, your life could be 100% dependent on it functioning properly. I think it’s safe to say that failure is unacceptable.

Okay, you say, but they work very well for a lot of people. This is true. But the devices are known to be imperfect – putting aside the issue that they may not function correctly when needed, there are clear concerns regarding malicious access by an attacker. There’s published research for this on both pacemakers and insulin pumps for diabetics.

The hard questions clearly irked a lot of people, including her doctor, who was greatly upset that she’d even be asking such things. The practical concerns did eventually win out (though she was able to get an older, less advanced device), leading to this statement:

I became a cyborg lawyer with proprietary software connected to my heart.


Switching focus to social networking, Bdale Garbee (possibly best known for free beards) has been working on FreedomBox, personal servers for social networking. The immediate need for another social network isn’t obvious – the key here is the storage and control of your own personal information. It’s your data, it should be kept on your terms.

As it stands, your data in Facebook/Google+/FoospaceEtc. could be stored anywhere in the world. For all the privacy policies and statements, you don’t know where that information is, or who really has access to it (think of legal jurisdictions). Designed with tiny “plug computers” in mind, this decentralisation should make it feasible to run your own server from home. Whether Australian internet will ever be up to the job is another matter…


There’s a lot we could go on about but for lack of time. In all, it was a very successful conference: a talk was given, ponies were node’d, a mobile phone was sent towards the stratosphere on party balloons, and Project Horus had their own successful launch. Next year we’re off to Canberra for LCA, hope to see you there!

0
Comments

LCA day 3 – High Availability

Published January 20th, 2012 by Barney Desmond

Thursday was more of a “practical” day, with plenty of hands-on hacking. This is nothing new, but nowadays you’re more likely to talk about running a bittorrent client on your bluetooth headset than linux on your toaster. There’s some genuinely awesome, really cool hacks out there (Android and Arduino is where a lot of it’s at), but they’re unlikely to help us give you 99.8% uptime. :)

Instead, we’ll have a really quick rundown of the high availability (HA) and virtualisation talks, and why it’s a good thing we sent a sysadmin along to them.


Complexity is your biggest enemy when trying to build reliable systems. Complex systems tend to be flaky, and that means they’re unpredictable. Unpredictable systems are bloody hard to support and rely upon. You won’t read this in all the you-beaut cloud services literature, but highly available systems are complex. Really, really complex.

This is all manageable, but it means your staff need to be trained with an intimate understanding of everything, top to bottom. When you’re unfamiliar with it, the HA stack on linux is like the bogeyman. It scares the living daylights out of you, and you try to pretend that if you close your eyes it’ll just go away. This is okay most of the time, but for a company like Anchor it would leave you dependent on a small team of HA gurus when things go wrong.

Thank $DEITY for the High Availability Sprint at LCA. Anchor can train you in The Way Of The Cluster if you so desire, but an enlightenment session from the jedi grandmasters is immeasurably valuable. Knowledge breeds confidence, and these things translate to a more effective sysadmin. If you’re an Anchor customer with an HA system, it means we can support you better, and respond faster when there’s a problem. Everyone wins!


To wrap up, a quick look at the presentation on Ganeti, software for management of a cluster of virtual machines.

We evaluated Ganeti for our needs a couple of years ago as a VM solution, and found that it wasn’t mature enough to really be usable. It’s clearly grown up since then, but I think it might be more interesting to discuss why it’s still no good for us.

Most people can probably look at the featureset and determine whether it’s what they need. Magical on-demand clouds of VMs are the “in thing” at the moment, what aren’t they good for? Well, it turns out they’re not much good for web-hosting.

This really became evident several months ago when we tasked a sysadmin with evaluating the various cloud management products on the market (free or otherwise). It’s kinda disappointing, but the truth is that we don’t need 100 instances of the same machine. We certainly don’t want them to be ephemeral. The other benefits touted by cloudy VMs, such as live migration and replication, are nice but ultimately not that useful for us.

In the end we developed a system that met our real needs, as plain as they are: really fast to deploy, fully automated, customisable, comprehensively supported and monitored.

0
Comments

LCA day 2

Published January 19th, 2012 by Barney Desmond

Bit of a quiet day today, the highlight was probably the presentations on btrfs and xfs. Btrfs has been developing nicely, and Avi Miller got up to spruik some of the newer features of the filesystem. A bit like ZFS (which isn’t compatible with Linux licensing terms), it pulls in a lot of smarts that are usually the domain of your RAID controller/subsystem. This means more flexibility in how you handle your data, but a lot of new complexity too.

It’s exciting stuff, but we’ll be waiting a bit longer to consider it robust enough to use in production. We’d kill for the integrated snapshotting (great for backups) and data integrity checking (store CRCs with your data) features.

Meanwhile, XFS reports steady progress and positions itself as the filesystem of choice for Really Big systems. Not that anyone would admit to it, but it was clear there was a little bit of rivalry between the two, especially since both talks were back-to-back in the same room. :)

Dave Chinner talked about how they’ve spent a lot of time working through the metadata performance issues that have caused headaches for scaling-up in the past, and reckons XFS should scale linearly, unlike the competition. Probably not something you’ll lose sleep over when deciding how to format your root filesystem, but definitely important for databases and big filestores.


In lieu of other diversions, let’s have a look at the LeoStick, which was included in the bag of goodies for LCA attendees, alongside the requisite stubby coolers and mousepads.

Unless you’ve been living under a really big rock, the Arduino is the go-to platform for hackers wanting to build embedded systems. This is thanks to ease of programming, fast prototyping, and expansion options (need a thermal probe? fingerprint scanner? CCD camera? there’s probably a single shield module with all of those things). The Leostick is particularly cute in that it comes in USB thumbdrive form-factor. As this is a pre-release board, the more cynical amongst us will note that this is a stroke of marketing genius that should result in some free beta-testing. Heh.

I know a couple of my fair colleagues are handy with a soldering iron; just quietly, this thing may or may not have had something to do with requests from the LCA organisers to stop messing with the exposed USB ports on the electronic door locks around campus.

0
Comments

LCA update, Day 1

Published January 18th, 2012 by Barney Desmond

Anchor’s talk went pretty well by all reports, huzzah!

Actually, it wouldn’t be fair to say it was that easy, so I’ll let the cat out of the bag on this one:

How Anchor's presentation slides for LCA2012 got done in time

Panel 1

T-Rex: Our talk to linux.conf.au got accepted!

Panel 2

{Close-up of T-Rex’s face, he is visibly excited}
T-Rex: It will be AWESOME

Panel 3

{Zoom out to show T-Rex and Dromiceiomimus. T-Rex is about to confidently stomp a tiny house}
Dromiceiomimus: You’ve prepared the talk months in advance, right?
T-Rex: 1337 speakers such as myself need no such preparation!

Panel 4

{Utahraptor replaces Dromiceiomimus in shot, verbally catching T-Rex just as he is about to stomp a tiny woman}
Utahraptor: But what about the slides?

Panel 5

{Now some distance apart, T-Rex and Utahraptor look directly at each other, in tense silence}

Panel 6

T-Rex: Oh uni placement dude?! Can I ask you a favor???


I kid, I kid – they did make the slides themselves, all of them. No uni students were harmed or exploited in the making of this talk.

To wrap up, one talk that covered a topic that doesn’t get much loving was Moving Day: Migrating Big Data from A to B. Mozilla had more than 40TB of data in their crash-reporting system, which demands near 100% uptime, and needed to move it all to a new datacentre – not something to be cowboyed the morning after an all-night bender.

Rigorous planning, automation and testing ensured that everything went smoothly; this talk instilled an idea of how to approach such a mammoth project with confidence.

This is something we handled when Github moved to Rackspace, but Mozilla also added a “post-mortem” phase – even if everything goes well (it did), there are lessons to be learnt from the experience, which stands you in good stead for the next time.

0
Comments

Exciting news from LCA miniconfs

Published January 17th, 2012 by Barney Desmond

Florian Haas gave a talk yesterday at the HA miniconf to present Flashcache, a project that was spawned from Facebook and their desire to squeeze more performance out of their databases.

The basic concept is to use any SSD device as a cache in front of slower rotational media. This is similar to commercial products such as LSI’s Cachecade, but implemented as a linux device-mapper module (so you wouldn’t be able to boot from such a setup, but that’s unlikely to be a real concern).

One of the nice things about Flashcache is that it’s presented as a plain block device. As well as making for a robust and understandable system, a practical upshot of this is that you can also replicate your cache with DRBD. In large HA database setups, this would mitigate a lot of the cache warmup penalty that you suffer after a reboot or failover event.

Flashcache is also fairly configurable, and exposes a lot of stuff through procfs rather than being a black box.

At the moment you have to build it as an out-of-tree module, so of course it’s not the kind of thing we’ll be rushing into production any time soon. Based on what we’ve seen in the past, I reckon there’s a good chance we’ll see Flashcache in mainline in a year or two if there’s a concerted push on development.

0
Comments

It came from beneath the raised floor

Published January 17th, 2012 by Barney Desmond

Yes, it’s another post about datacentre horrors. I know what you’re thinking: “Yeah yeah, I’ve seen the one about the cabling“.

Yeah well I used to be a datacentre technician like you, then I took a PCI-slot shiv in the knee.

(Edit: Hrm, it looks like the owner nuked the gallery but the files still exist. You can try the Google cache, or a copy that was nabbed.)

0
Comments