A change of perspective: Michael Sharp

Published February 28th, 2012 by Hououin Kyouma

Michael’s just moved across from our hardware team to our customer service team. What makes a man say goodbye to rooms of purring servers to help customers with issues? Let’s find out…

New job: Level 1 Customer Support Technician (for almost a month now)

Previously: Datacentre Technician, been with Anchor since August 2009

Changed because: My previous job wasn’t as interesting as I first thought

What’s the most interesting it ever got?
When we provisioned a large shared storage cluster. I received it, processed it, installed it and configured it. No-one else at Anchor had done anything like it before so I got to develop some specialist skills, had a lot of trust placed in me.

Who do you look up to at Anchor? (The other) Michael’s very good at very technical problems, Barney’s also pretty amazing.

Where do you hang your hat?
I grew up in the glorious Greater West; now I reside in Kensington.

How do you get to work from Kensington?
A motorbike.

Just any motorbike?
No, not just ‘any’ motorbike – this is ‘the’ motorbike: a Suzuki SV650. I feel the need for speed.

Gaming platform of choice: PC

Top titles of all time: Freespace, Mass Effect series

Gravity? Yes, gravity.

Tags: , ,
Posted in FTW

 Leave a comment

0
Comments

Q&A with Ryan Brailey

Published February 21st, 2012 by Barney Desmond

Ryan joins us here at Anchor for an eight-week internship. We caught up with him over breakfast to see how it’s going.

Could you tell us a bit about where you’re from and what brings you to Anchor?

Sure. I’m from Camden, and I’m currently in my third year of a B.IT (Networking) at University of Wollongong.

So you’re doing an internship here, what’s a normal day at Anchor like for you?

At the moment I’m working with the Nagios monitoring software and auditing customers’ server support levels: stuff like CPU utilisation, load, memory usage, etc. That and helping our sysadmins monitor and respond to issues.

What’s your next career move?

My degree is in networking so I’m hoping to get into a sysadmin role or something similar. I like being around hardware, but the scope is very narrow if you limit yourself to just hardware.

Your favorite place to work?

The Global Switch datacentre. It’s near the office, just across Darling Harbour in Ultimo.

Anchor – rocks or sucks?

Definitely rocks. My expectations of working here have been met and exceeded. It’s been really impressive to see how much expertise they have, how much custom-written stuff they use to monitor and respond to customer issues.

Last question, where abouts for your next internship?

Here, please. If I weren’t interning at Anchor I just don’t think I’d be learning as much.

Tags: , ,
Posted in FTW

 Leave a comment

1
Comment

A stronger tingling sensation

Published February 20th, 2012 by Barney Desmond

If you’re just joining us, you should probably have a quick read of the previous post in which I talked about tingle, a tool to help you think less when it comes to installing package updates.

Now that we have a tool to make updates quick and easy, we need to use it. The critical thing here is making it work for us, not the other way around. As an example, one of the reasons RHN is no good for our Redhat systems is that it relies on a daemon periodically checking-in to the RHN server, so you don’t have a firm enough idea of just when your updates are going to happen.


There’s a number of goals that we need to fulfil. Some are business matters, and others are a more general policy.

  • Updates must be performed in a timely manner, once a week is considered acceptable
  • We must be notified if updates aren’t happening, so as to not breach our contractual obligations
  • The customer must be notified at least three business days in advance of updates occurring, seeing as updates incur a small amount of downtime
  • Each machine can have its own designated timeslot for updates (eg. Monday nights at 9pm)
  • Scheduling needs to be centralised, so it’s automatable and easy to modify
  • Being able to suspend updates for a week should be easy, in case the customer asks us to hold-off

We already have a lot of supporting systems, so it made sense to integrate tingle into those used for automation and administration. We use Puppet to ensure a consistent state for our managed machines, and an in-house provisioning system for a “source of truth” and record-keeping.


Naturally, everyone will have different ideas how automated updates should be done. Certainly they’ll have different supporting systems to tie into. What I’ll cover here is very Anchor-specific, but it should give you some ideas if you ever want to undertake such a setup.

  1. When a new server is built by our automated system (the same one we talked about at LCA), it generates a timeslot by hashing some server-specific details and inserts that into our provisioning system.
  2. Puppet runs every four hours (exact times are chosen by a similar hashing function) and queries the provisioning system. If a timeslot has been recorded, puppet sets up cronjobs to apply updates and also email the customer.
  3. Three days before the designated timeslot, tingle checks for any pending updates, and sends a notification accordingly. The server doesn’t know the customer’s address, so the mail goes via a custom Lamson app that queries our admin systems for their email address and sends the message on its merry way.
  4. Barring any Hold-orders from the customer, tingle does its thing as scheduled
  5. After a successful tingle run, the host submits a passive check result to our Nagios server. This is very efficient, and fulfils our requirements for monitoring.

There you have it, servers that keep themselves patched, predictable, and the customers happy. The only servers we don’t auto-tingle are older Supermicro hardware – they don’t have remote-access cards, so we’d rather be fully attentive when applying their updates. For Dell hardware and virtual servers, it’s tingle à gogo!


Let’s wrap it up with a final point about why this is really important. There’s a popular signature you see on forums every now and then: in a nod to the movie Fight Club, one of the lines is “You are not your uptime stats”.

If you’re not keeping your servers patched, you are seriously doing it wrong (we’ll concede that most, but not all systems need an uptime-destroying reboot to patch the kernel). The recent /proc/pid/mem vulnerability in the linux kernel and its easily-demonstrated exploit, Mempodipper, should be a timely reminder of this.

Some customers get a bit sore about the rebooting, so to put it in perspective: assuming there are package updates every single week, and it takes about 5min to reboot, that’s 260min of (scheduled) downtime per year. That’s 99.95% uptime.

Let’s be pessimistic and double the downtime estimate for each reboot, so you’re up to 520min (8h:40m) per year. 10min per week is bang-on 99.9% uptime (“three nines”), and you know exactly when your downtime is scheduled. If you think you need better, you can setup a second server and have them cover for each other, but you’ll pay the price in added complexity. Realistically there are so many other things that will bite harder than rebooting for updates, you really have no excuse!

0
Comments

That tingling sensation

Published February 17th, 2012 by Barney Desmond

This is the first post of a two-part set.

We’re talking about keeping your system up to date, specifically Linux ones.

Maybe you religiously check for updates and apply them every Monday. Maybe you’re a bit perverse and start your morning with a black coffee and a quick emerge world. Or maybe you’re lazy (like me) and only bother doing it whenever you login and see “66 packages can be updated”.

Whatever you do, keep doing it, and regularly. But what do you do when you have hundreds, maybe thousands of machines that need updates applied? (Hint: we’re lazy, so the answer is always “make a robot do it”)


We used to apply updates manually, which worked well when there weren’t so many servers to deal with, and you can check for any problems along the way. The first step to automation was realising that applying vendor-supplied updates wholesale, without oversight, isn’t really that scary (you might not think it so scary, but we have an aversion to risk and a responsibility to our customers). We maintain RHEL, Debian and Windows servers here, and all three of them have proven to be solid when it comes to not-detonating-your-system.

Well that’s easy, you just need to roll out a cronjob everywhere that runs yum -y update everyday, and the problem is solved. And aptitude update ; aptitude -y safe-upgrade for Debian. Better add the –quiet flag as well, so that cron doesn’t send spurious mail, and/or discard the output by redirecting all output to /dev/null.

You could do all this, and then you really hope nothing breaks, ’cause it’s hard to know just what’s happening on your system now.


These annoyances aren’t insurmountable, but they’re tedious and finicky. That’s why our resident Overkill Specialist developed tingle.

tingle abstracts away the little details of your distro’s package manager, making things consistent between systems, and adds some smarts to make thing more efficient and flexible for your environment. On a tingle-enabled system you just login and fire a single command:

tingle reboot

That’s it. Updates will be downloaded, applied, then a reboot will be scheduled to occur in 10min. If you don’t want the reboot (which is necessary for kernel updates) then tingle apply will do the trick.

We make use of tingle’s hookpoints to remove old kernel packages and run puppet after applying updates. Given the rich set of hooking options, you could easily have the server remove itself from a load-balancing cluster before applying updates, or suppress monitoring warnings while it reboots. Whatever you do, the end result is the same: systems that are more reliable and need less human intervention for mundane tasks.


If you’d like to try out tingle, you can find a copy in our github repo at:

https://github.com/anchor/tingle

Tune in for the next post on Monday, when we’ll talk about how we automate the use of tingle. It’s useful on its own, but we still need to solve the very real problems of making it happen, arranging a suitable time for this to happen with the customer, and of course the elephant in the room: monitoring it to make sure you’re meeting your contractual obligations (non-existent updates are the ones that cause zero downtime…)

0
Comments

Dedicated crypto accelerator cards? Please, that’s so last decade

Published February 15th, 2012 by Barney Desmond

Today I’ve been looking over the legacy architecture for a new customer we have coming on board. I think it’s fair to say that they’re of a substantial size.

One of the things that stands out to me is that they have five load-balancers (huh?) on the public-facing end, and then seven nginx frontends terminating SSL traffic and serving static content. Let’s ignore the load-balancers, I think they’re just some cloud-y appliance. The frontends are where it’s at.

These are some pretty substantial VMs (a certain provider’s 2gb instance) running SSL all day and not much else. Their app doesn’t even run on the frontends!

SSL crypto is very much the lifeblood of internet commerce, it’d come to a screeching halt without it these days. It’s just an unfortunate fact that it’s computationally very expensive.

SSLShader – A GPU-accelerated SSL Proxy

Now we’re talking. Unlike using your GPU for swap space, this actually sounds kind of sensible.

The benefit of using a GPU comes from the heavy parallelisation inherent in the architecture, which is great when you want to serve many requests in parallel. Like on a web server. Assuming you can fit them in the chassis (powerful GPUs tend to be two slots wide, which doesn’t jive well in a 2RU rackmount chassis), GPUs should be quite cost effective, too.

What about modern CPUs with AES-NI instructions?“, you might ask. It’s good, but it’s really more relevant for bulk crypto.

Every SSL transaction starts with a key-exchange handshake, which uses RSA. RSA is extremely computationally expensive, which is why we use it to bootstrap a symmetric cipher like AES. You can go for your life with optimised AES-NI once the key-exchange is done, but the RSA is still a big startup-overhead – SSLShader shows promising results here.

SSLShader doesn’t look ready for primetime just yet (code not readily available), but it’s a very exciting piece of research. Whether it’s something we really need in the datacentre is an unanswered question, but it looks like a decent solution to a real problem that some websites will face.

(and just think, you can mine bitcoins when the website’s not busy…)

0
Comments

Meet our newest Anchorite – Minh Duy Do

Published February 14th, 2012 by Barney Desmond

Minh joins us while completing his Bachelor of Business (HR) at the University of Western Sydney, and has already set his sights on achieving some mighty big goals at Anchor. Asked what he wants to achieve in the next twelve months, he thinks for a moment before replying with a smile: “CEO!” ;-)

For the moment Minh will be, as he puts it, “one of the few Anchor team members who aren’t a sysadmin” – working with Jess and Phil in a high performance team, managing new client enquiries.

We wear our über-sysadmin credentials with pride at Anchor, and Minh noticed this when he first interviewed with us.

“Originally, I applied for level 1 technical role. At other companies, level 1 in any discipline is really easy. But at Anchor they gave me this quiz, and when I flipped through it, by page three it was all programming – not easy programming either!”

“Before I applied for the job I researched Anchor as a company and felt I would like the culture. Everything you see on the Anchor website, how the rest of the industry speaks of them, you can tell they’re proud of their technical skills but otherwise really down-to-earth.”

“People here are trusted with a lot of independence and freedom to do their job on their own terms. In return, the company gets really high productivity and a lot of loyalty from the team.”

In his free time, Minh studies 6th-grade violin and is an avid WoW player (Level 85 Orc Hunter with a phosphorescent stone dragon mount, reprazenting Guild HasNoPants on Dreadmore).

1
Comment

Collaborative hydration session tonight

Published February 8th, 2012 by Barney Desmond

Our pals over at Github and Heroku are having a drinkup this evening at King St. Wharf. It’s not strictly our gig, but a few of us will be along to shoot the breeze on sysadminly topics and have a yarn.

More details on Github’s page, come along if you like hanging around hardcore technical people (and beer taps).

Tags: , , ,
Posted in FTW

 Leave a comment

0
Comments

100% FAT-free

Published February 6th, 2012 by Barney Desmond

I wrote some documentation for our sysadmins last week detailing how one should deal with a critical diskspace notification at some ungodly hour of the morning. On the specifics of checking filesystems with the df tool:

“Astute readers will notice that we don’t query btrfs filesystems here; this is because btrfs uses extents, and inodes are a non-issue.”

Well, I wasn’t entirely wrong, but I wasn’t entirely right either.


btrfs is a modern filesystem with lots of shiny new features. It’s definitely not production-ready yet, but like a magpie drawn to shiny things, a couple of us use btrfs on our own machines (it’s what backups are for, right?).

Some time ago I wrote about how an ext filesystem can run out of free inodes and bite you. That happened to me last Thursday, only this time it was btrfs under the hood.


I first noticed the problem when puppet wouldn’t run, saying there was already another instance running. puppet is dumber than a bag of rocks so I pressed on, trying to run aptitude update instead.

root@misaka:~# aptitude update
E: Write error - write (28: No space left on device)

O rly? df disagreed about that. I immediately thought of inode exhaustion, but btrfs isn’t meant to suffer from this problem! To prove it, I touched a few files, successfully wrote some bits, deleted them again – all good.

Their curiosity piqued, my fellow sysadmins cracked open the strace and confirmed what we knew: ENOSPC from the write() call. We were at a loss until someone serendipitously spotted some errors in the syslog:

Feb 2 19:09:31 misaka kernel: [683642.593034] no space left, need 4096, 10694656 delalloc bytes, 696373248 bytes_used, 0 bytes_reserved, 0 bytes_pinned, 0 bytes_readonly, 0 may use 707067904 total
Feb 2 19:09:55 misaka kernel: [683666.684247] no space left, need 4096, 6905856 delalloc bytes, 700162048 bytes_used, 0 bytes_reserved, 0 bytes_pinned, 0 bytes_readonly, 0 may use 707067904 total

A little googling produced a promising bug ticket on Redhat, “[btrfs] hopeless ENOSPC handling and excessive administration costs“.

The short version for our specific scenario is: df doesn’t expose some exhaustion issues because btrfs doesn’t work like a classic filesystem.
This is where you can start moaning about how btrfs is FitH if you’re so inclined, but I like playing with my shiny toys, thank you.


btrfs has its own version of df for inspecting the filesystem:

root@misaka:~# btrfs filesystem df /var

Metadata, DUP: total=95.12MB, used=15.16MB
System, DUP: total=8.00MB, used=4.00KB
Data: total=674.31MB, used=665.52MB       <-- Under 10MB free!!
Metadata: total=8.00MB, used=0.00
System: total=4.00MB, used=0.00

This would explain why I could create files myself, but stuff like aptitude was failing when it tried to write more than several MB. You'll also notice that there's a lot of allocated-but-unused metadata space in the first line of output.

We have a tool to fix this, and unlike btrfsck it's actually usable. We can rebalance the filesystem to adjust the proportion reserved for data. Some commenters on the bugzilla ticket noted that it caused a kernel panic when they ran it, but that was two years ago. It's probably fixed by now...

root@misaka:~# btrfs filesystem balance /var

# Now when we run `df` again...
Metadata, DUP: total=47.56MB, used=15.20MB  <--- Much less allocated
System, DUP: total=8.00MB, used=4.00KB
Data: total=745.38MB, used=665.52MB         <--- Plenty of free space
Metadata: total=0.00, used=0.00
System: total=4.00MB, used=0.00

Mission Accomplished!

aptitude and puppet run fine now, so all is well. As a note, the rebalancing is (subjectively) not fast: it took 7-8sec on that 1gb filesystem.


To wrap things up, I thought I might extend that filesystem a bit, as some more breathing room would be good. The btrfs volume is on an LVM logical volume, so this is a pretty easy task.

  1. Extend the LVM LV by 512MiB
    lvextend -L +512M /dev/misaka/var
    
  2. Grow the btrfs filesystem to fill the newly-enlarged block device
    btrfs filesystem resize max /var
    
  3. Rebalance the btrfs filesystem (optional?)
    btrfs filesystem balance /var
    

Now, I'm not sure whether the final rebalance is strictly necessary. The system's df tool acknowledges the extra size after the resize operation, but btrfs-df shows no change in its output until the rebalance is done. A little testing would be in order, but I'd rather do it on a dedicated testing machine.

Any other cowboys out there using btrfs? Your data may or may not be intact when the sun rises tomorrow, but boy it's exciting!

Tags: , ,
Posted in FTW

 Leave a comment

0
Comments

Channelling your rage

Published February 3rd, 2012 by Barney Desmond

Getting notifications when servers break is always annoying. We use Nagios at Anchor, a very popular solution. “Friggen nagios!” is a pretty common cry.

If you get a lot of notifications in quick succession, your Rage meter starts to build up. When it hits 100% you unleash a special attack and reboot the server.

Rachel's gauge is at 100%, circled in blue crayon. She can now reboot the server with her Static Iris

That’s pretty cool, but it turns out that customers don’t like reboots as much as us, so we looked at ways to reduce the rage. One great way to do this is with better documentation; we call it Ragewiki.


Making use of the notes_url parameter, we provide a link to our wiki documentation directly from Nagios’ web interface. There’s one page for each service, with precise instructions on how to diagnose and fix common problems, as well as a brief description of what the service actually does.

So now when you get that SMS at 3am (PROBLEM – ntype on fundle is CRITICAL), you don’t spend 20 minutes flailing through A Brief History of Time, as told by H.P. Serverbox.


To sweeten the deal a bit, we also allow for host-specific instances of a service, which might need extra-special instructions. We also have a page full of terse legacy documentation that we’d like to fallback on in case the new docs haven’t been written yet. We think it’s a cute little hack so we’d like to share with you.

The possibilities are up to your own imagination, we just went for the most straightforward option. You could always link to a big red button that reboots the server straight away. :)

  1. Give every service a URL in the Ragewiki, using the notes_url argument. We attach this to the generic service template so that every single service automatically gets a link.
    # RageWiki ftw
    notes_url /ragewiki/$HOSTNAME$/$SERVICEDESC$

    You’ll notice that we’ve parameterised the URL so that each host-service pair is unique

  2. Prepare a rewrite map to check for existence of docs
    This URL will refer to the Apache instance on the nagios server itself. It captures the request starting with /ragewiki/, extracts the hostname and servicename, then builds a suitable redirect.

    Because we want to support per-host pages that may exist, we use a RewriteCond and a smart RewriteMap to check whether the page exists, then redirect accordingly. We use moin as our documentation wiki, with HTTP access control in front of that.

    RewriteLock /var/lock/rewrite.lock
    RewriteMap RageWiki "prg:/usr/bin/xargs -n1 -d '\\\\n' /usr/bin/HEAD -sd -H 'Authorization: Basic EncodedUsernameAndPassword'"

    You may want to read up on Apache’s RewriteMap functionality to make sense of this. The short version: it contacts the wiki and returns the HTTP status line for the suggested page. A 200- or 300-series status code is considered a success – the page exists and should be used.

  3. Finally, use the RewriteMap and generate a suitable redirect
    This is a basic set of cascading rewrites, the first success will terminate further processing.

    # Server-specific docs: /servers/$HOSTNAME/$SERVICENAME
    RewriteCond ${RageWiki:https://magic.ponies.anchor.net.au/servers/$1/$2} ^[23]\d\d
    RewriteRule ^/ragewiki/([^/]+)/(.+)$ https://magic.ponies.anchor.net.au/servers/$1/$2 [R,L]
    
    # Whole lotta BGP goin' on (with variable check names, a variant of generic docs)
    RewriteCond ${RageWiki:https://magic.ponies.anchor.net.au/Nagios/Services/bgp} ^[23]\d\d
    RewriteRule ^/ragewiki/[^/]+/bgp[_-].+$ https://magic.ponies.anchor.net.au/Nagios/Services/bgp [R,L]
    
    # Generic docs for normal services: /Nagios/Services/SERVICENAME
    RewriteCond ${RageWiki:https://magic.ponies.anchor.net.au/Nagios/Services/$1} ^[23]\d\d
    RewriteRule ^/ragewiki/[^/]+/(.+)$ https://magic.ponies.anchor.net.au/Nagios/Services/$1 [R,L]
    
    # Catch any checks without docs, and send them to the fallback page.
    # Funky regexes to pass the failed service name through to the fallback page.
    # FIXME: Can we use a positive-lookbehind in these things? Would make it slightly tidier.
    RewriteRule ^/ragewiki/([^/]+)$    https://magic.ponies.anchor.net.au/CommonNagiosServiceCheckReference#$1 [NE,R,L]
    RewriteRule ^/ragewiki/.*/([^/]+)$ https://magic.ponies.anchor.net.au/CommonNagiosServiceCheckReference#$1 [NE,R,L]
    

    Special cases with varied names, like our BGP checks, are easily handled by dropping a custom regex into the chain. It’s best if your service names have a consistent format that can be readily pared back to a basic name, but this method is fine for the occasional odd case.


Too easy! To give you an idea of what we think good Ragewiki docs look like:

  • What servers does this apply to?
  • Summarise what the nagios check is for (one sentence!)
  • What’s the impact of a failure? Customer visible? Websites are down? Etc.
  • A short procedure on how to confirm the notification and diagnose it further
  • A procedure on how to fix it

That’s it; the page should only be a couple of screens long at the most. If you can’t include all the necessary information, it’s best to put it on a separate and link to it. We specifically don’t include information about How It Works because it detracts from fixing problems faster.

Ragewiki works great for us, so we’d be interested in hearing your thoughts and comments. It’d also be cool to know if other people have reached the same goal, but in a different way.

2
Comments