Getting to know Henry Wang

Published March 30th, 2012 by Barney Desmond

New Anchorite Henry Wang joins our growing account management team under the watchful management eye of Jess Field. The former HP account manager, who has just come back from travelling in the UK for almost two years, says Anchor so far is “max cool!”

Your Background?

My family is from Hong Kong, but I’ve always lived in Strathfield and the inner west (of Sydney). I love it there, particularly all the different and interesting food that’s available!

You’ve just come back from the UK – why there?

My fiancée and I really wanted to travel, and we thought now was the right time, before we settled down and had a family. Unfortunately a family illness meant we had to come home a bit before we were ready, but I’m glad to be back.

Did you work while you were over there?

Yes, as an account manager for Capital Support. We sold everything there in IT, from hardware, PCs, servers, networking, printers, and services including onsite engineers and ad hoc engineering solutions. They had their own cloud solution too.

How are you finding Anchor so far ?

Anchor is max cool! The people are friendly and helpful. It’s a very technical environment and a very technical team, and I love these guys. It’s very enjoyable, and I realize I have a lot to learn!

I’m just finishing up my training so I’m learning and absorbing as much as I can, reading and speaking with the various members of the team, even Keiran [our CEO –Ed.]. He’s very open and approachable.

And finally, what do you like doing outside of work?

I love hanging out with friends and I’m a bit of an amateur photographer. I’ve got an “old school” Canon 400D SLR with a 28-135mm lens. I use it to shoot people, landscapes, anything. I find it relaxing to just hang out on weekends and capture great images that interest me.

Tags: , ,
Posted in FTW

 Leave a comment

0
Comments

Bugfixing the in-kernel megaraid_sas driver, from crash to patch

Published March 28th, 2012 by Barney Desmond

Today we bring you a technical writeup for a bug that one of our sysadmins, Michael Chapman, found a little while ago. This was causing KVM hosts to mysteriously keel over and die, obviously causing an outage for all VM guests running on the system. The bug was eventually traced to the megaraid_sas driver and the patch has made it to the kernel as of version 3.3.

As you can imagine, not losing a big stack of customer VMs at a time, possibly at any hour of the day, is a pretty exciting prospect. This will be a very tech-heavy post but if you’ve ever gone digging into kernelspace (as a coder, or someone on the ops side of the fence) we hope it’ll pique your interest. We’ll talk about the diagnostic process and introduce some of the new tools that made this possible.

Here we see Michael analysing a rare and dangerous kernel bug

Difficulties faced

The exact circumstances of what we saw weren’t terribly interesting, it suffices to say that you’d lose a whole machine and not have much useful logging to work with. Our VM infrastructure is mostly RHEL 5 and 6, running on Dell R510 hardware.

A handful of KVM hosts were affected while others were rock-solid. Differential diagnosis was frustrated by the fact that you can’t play around with live machines, that the fault couldn’t be triggered at-will, and that we don’t have a huge install base of hosts in which to find patterns. In addition, pretty much every dimension for analysis that we could think of had working and non-working cases.

As an example, we looked at the chassis, drives, RAID cards and kernel version, amongst other things. The problem might manifest on servers A+B+C, all with the same kernel version, but not on server D which has an older kernel. “Aha!” you might think, but then you’d find that server E is also crashing and has the same kernel as server D.

Just for the record, we’re running Redhat Enterprise Linux 5 and 6 systems here. The tools we’re using are generic, but the usage details may be specific to RHEL.

Having a peek at the crashed system using kdump

Figuring out even the general circumstances of the crashing was not easy. All our Dell servers have out-of-band management cards (DRACs) that let you view the console, but these aren’t always helpful. Often the screen will be blank (default console-blanking in Linux), but even if you plan ahead and disable that, there’s only so much you can do with a crashed system from the console. You need something more.

Enter kdump. kdump is similar to other utilities like diskdump and netdump that let you grab a core from a system and push it to disk or across the network. They work, but they have some prerequisites relating to your storage and networking drivers that can be problematic. kdump dodges these issues by taking another approach. (If you’d prefer the canonical lowdown on kdump, head over to the kernel’s git repo and check out their docs.)

One feature that’s been around for a little while now is kexec. We’ll gloss over the details, but kexec is an interesting hack that lets you reboot into a new kernel by jumping straight into execution of the new binary. It’s a bit of a cheat and there’s many ways you could come adrift in the process (who knows what state your devices are in?), but it’s the perfect answer for simpler tasks. kdump leverages kexec to execute some code that will grab a copy of the system state and leave it somewhere safe.

kdump starts by setting up what’s called a “crash kernel”. This is a special, really skinny kernel and initrd with just enough smarts to dump a core for you, then reboot. You boot the system as normal, passing the crashkernel parameter in the kernel line from GRUB. This sets aside a chunk of memory (about 256MiB) that nothing else can touch, and then loads the crash kernel into that space.

When the system crashes (panics), instead of turning into a blazing fireball it uses kexec to jump into the crash kernel. The crash kernel runs makedumpfile to capture a pristine copy of the panicked system and dump it to disk – we set aside an LVM volume for this purpose. Once it’s done, the system reboots as normal, and all your VMs come back online.

The first kernel has already guaranteed that the crash kernel won’t be touched by blocking-out that memory, so it’s good to go at the drop of a hat. In turn, because we used kexec we can jump into the crash kernel knowing that all the crashed state is perfectly intact. The crash kernel and associated initrd has enough smarts to load the drivers necessary to mount a filesystem on LVM and write the dump, but it can also push the dump to a remote networked system if you’d prefer.

Our systems are big. Like, “128GiB of RAM” big. It takes a long time to write that much data even to fast RAID arrays, so we pass arguments to makedumpfile (check the manpage) to tell it to not bother with zero pages, cache pages, etc. Remember that any time spent running kdump is time that the VMs are down.

Inspecting the crash site

Brilliant, now you’ve actually got something to look at. This sure beats rebooting the system and hoping it doesn’t happen again. But now where to?

This is where you fire up crash (an aptly, but inconveniently named tool). crash takes the debugging power of gdb and adds smarts to make it kernel-aware and more suited to getting this sort of work done. It has its limitations and quirks (seriously, guys, readline isn’t that hard), but it does the job.

That’s well and good, but we don’t know what we’re looking for yet. Using a debug-build kernel we’d managed to extract the eventual cause of the panic: corruption in some kernel data structures. The kernel’s slab allocator manages a series of caches, all of which are visible if you inspect /proc/slabinfo on your system. We were seeing corruption specifically in the buffer_head cache.

Buffer heads are 112 byte structures (the “objsize” heading in slabinfo) carrying metadata pertaining to I/O blocks. The slab allocator packs 36 buffer heads into each slab (“objperslab”), and all the slabs of buffer heads together form the “buffer head cache”. A cache is really just a doubly-linked list.

Using crash we inspected the dumps of crashed systems. What we saw indicated that the panic was occurring due to corruption in the linkage pointers between slabs. The corruption was apparently contained to the buffer_head cache, but in the slab structures themselves, not the buffer heads contained within the slabs.

Cross-referencing the kernel source, we found that this sort of corruption is being sanity-checked for, but that only occurs when the linked list is actually traversed. The corruption could occur at any time but wouldn’t bring down the system until something runs through the buffer_head cache (for extra fun, if the corruption were very precise, it could delay the panic until something traverses the cache (a doubly-linked list) backwards).

This is a good starting point. Using crash we could see the exact nature of the corruption and where in memory it was occurring. Assuming that perhaps a pointer was being misused, we searched the memory space for anything else pointing at the corrupted area but found nothing. Remembering that there could be an a arbitrary amount of time between corruption and panic, this isn’t too surprising.

Caught in the act

Without any fingerprints from the culprit we had to resort to using a live system. Luckily for us, we have SystemTap. SystemTap is a relatively new tool with some similarities to Solaris’ well-regarded Dtrace, letting you hook into a running system for analysis and diagnosis. SystemTap scripts resemble C code, and are compiled into dynamically-loaded kernel modules, sidestepping the need to prepare a system ahead of time for debugging (like recompiling your kernel with debugging flags enabled).

Now knowing what we were looking for we began setting up SystemTap hooks on chunks of kernelspace. This included hooks on the slab allocator dealing with the buffer head cache, checksumming functions to attempt to catch corruption when it happened, and a function to effectively perform a fsck on the buffer head cache itself.

Make no mistake, this is slow work. It’d taken about a full week worth of workhours to get to this point, and the cause of the corruption still wasn’t clear. Through a lot of tracing and source-diving we were starting to point the finger at the RAID driver. This isn’t an uncommon line of reasoning – crashes and other “bad events” often line up with “lots of stuff happening” on the system, but the correlation just wasn’t so clear-cut this time.

This did give us an interesting angle to look at. Going back to the differential diagnosis, we realised there was more of a pattern to the symptoms than we’d realised before. They were, however, very specific.

Summarising the findings:

  • The crash is due to corruption in the kernel slab allocator
  • The corruption is only evident when there is a large number of buffer_head objects in memory, such as during high I/O (The corruption may be occurring regardless, however)
  • The corruption appears to be triggered by MegaCli, the megaraid_sas driver, or the PERC device itself, when MegaCli invokes an STP (SATA Tunnelling Protocol) command on a SATA device in the chassis

A SATA device? But we only use SAS drives in these servers. Well, that’s not entirely true…

Newer servers are using CacheCade, a new performance feature that we wrote about several months ago. The SSDs used for cachecade are SATA devices – this was a “lightbulb moment”. :)

So now we’ve identified MegaCli as the culprit, but why is this happening? It turns out that monitoring is to blame. We poll every system with an LSI Megaraid card once an hour to check that the RAID arrays are healthy, which uses the MegaCli tool to query the card for its status. When that happens, MegaCli performs an STP command – that’s our trigger for corruption!

Falling down the Megacli rabbit-hole

We immediately killed the Megaraid monitoring on all servers. We might miss an array failure, but we can limp through those; another panic-crash would be unacceptable now. We needed to figure out what was going on, and why.

MegaCli isn’t distributed as a source package, you just get this binary blob. We can’t trace exactly what’s happening when the STP command is sent but we can spot the ioctl calls with strace, and pull them apart using SystemTap.

By the end of the day we’d managed to figure out a few more things:

  • The problem occurs on a variety of RHEL kernels: 2.6.32-131.12.1.el6, 2.6.32-131.4.1.el6, 2.6.18-274.7.1.el5
  • The problem occurs with a variety of megaraid_sas drivers: 5.34-rc1, 5.38-rh1, 5.40
  • Only MegaCli 8.01.06 has been shown to invoke this command. MegaCli 8.02.16 does not appear to do so

This last point is the key – finally we’d established the exact conditions under which corruption could be triggered, but it needed testing to be be certain.

A note on the diagnosis: Even if the cachecade difference had occurred to us earlier, it would’ve been masked by the way system updates are handled. Due to their sensitive nature, KVM hosts receive a more conservative update schedule. Hosts that happened to have received the updated MegaCli package would be protected from the issue, but for no obvious reason – userspace components aren’t generally expected to cause issues like this.

Before we continue we’ll have a quick word from our sponsor, the SATA Tunnelling Protocol (STP). STP is used to support SATA devices attached to a SAS fabric. It’s basically an encapsulation layer that makes a SATA device look and behave like a SAS device; it’s necessary because SATA devices don’t support all the features that SAS takes for granted.

Reproducing the corruption

A fix is no good unless you can be confident that it works. Science FTW! To reproduce the problem we devised a method to give a high probability of corruption and a crash occurring, based on the knowledge we had.

Buffer heads seem to be the victim, so let’s have lots of them. We did this by performing a simple dd transfer from /dev/zero to a spare LVM volume, creating millions and millions of buffer_head slabs in the cache.

Then we ran MegaCli to query the RAID card’s health. Sure enough, our SystemTap scripts detected corruption in the buffer_head slabs. That’s a hit.

Then to round it all off we forced a traversal of the buffer_head cache – boom! Just as planned.

Fast-forward a few hours, we verified that the newer version of MegaCli doesn’t cause the corruption, then proceeded to build v8.02.16 packages for all our systems so we could safely get the monitoring back online.

Source-diving

The solution is so close, you can taste it! Poring over the driver code, we did manage to nail down precisely what was happening. What follows is mostly a copy of our internal notes, they get down and dirty with the driver and explain exactly what was happening.

The commands sent through from MegaCli contain, amongst other things, a “frame” and an IO vector (a scatter/gather list). The “frame” can have one of a number of different formats (they all have the same header, though); one of the formats is for STP commands.

The IO vector tells the megaraid_sas driver where in the userspace address space the data should be sent to or received from the device itself. The driver allocates corresponding DMA memory chunks for each entry in the IO vector, and determines what kernel addresses correspond to those chunks. It also handles the copy_from_user/copy_to_user stuff to get the data from userspace into these kernelspace chunks and back again. All good so far.

The DMA addresses for these chunks are all 32-bit. The “command” sent to the device itself consists of the “frame” taken from userspace, along with a *new* IO vector with the DMA addresses.

One complication, however, is that the IO vector in the command can have one of three formats: 32-bit addresses, 64-bit addresses, and “IEEE” (as far as I can tell, that’s IEEE 1212.1), which is a variant of 64-bit addresses.

The driver knows how to interpret the command’s IO vector through some flags in the frame. These flags are sent through from userspace without any interpretation or adjustment by the driver.

Here is where things break down: for the STP commands only, MegaCli appears to turn on the IEEE flag. The driver, however, always fills out the command’s IO vector with 32-bit addresses, ie. an array of:

    struct megasas_sge32 {
        u32 phys_addr;
        u32 length;
    } __attribute__ ((packed));

The card, however, treats it as an array of IEEE scatter/gather list entries:

    struct megasas_sge_skinny {
        u64 phys_addr;
        u32 length;
        u32 flag;
    } __packed;

NB: “Skinny” appears to be a codename for a particular megaraid model. I don’t know what the difference between “__attribute__ ((packed))” and “__packed” is.

For the first entry in the STP IO vector, length == 20 == 0×14. The device therefore sees a 64-bit DMA address of 0x1400000000 + phys_addr instead, and clobbers the wrong memory.

To give an example: say the kernel allocated the DMA address 0x91dfc000. The corresponding kernel virtual address is 0xffff880091dfc000 (the kernel simply maps physical memory one-to-one from 0xffff880000000000). The device ends up writing to DMA address 0x1491dfc000, which has the kernel virtual address 0xffff881491dfc000. The first 20 bytes of whatever was at that page have just been erroneously overwritten.

The really lazy sysadmin’s version:

  1. MegaCli prepares some memory to receive results from the card
  2. MegaCli says to the card “Please do something and then write the results back to THIS memory address that I just setup and zeroed-out for you
  3. MegaCli specifies the address in one format, but sets a flag indicating that it’s in another
  4. The card does as its told, reads the flag to interpret the address format, then writes the results to the wrong location in memory. There’s no protection against this because we’re in kernelspace
  5. MegaCli doesn’t notice/care because it was probably expecting zeroes in the result anyway
  6. Assuming the memory address was in use, some poor sucker just got zeroes splattered over their slab headers

Making the patch

So, what’s the long-term solution? For now we’re using a MegaCli binary that appears not to invoke any STP commands, but it’d be even better if those STP commands weren’t corrupting memory.

A simple patch to the megaraid_sas driver can ensure that the correct flags are sent to the device:

--- megaraid_sas-v00.00.05.40.orig/megaraid_sas_base.c	2011-07-16 08:01:59.000000000 +1000
+++ megaraid_sas-v00.00.05.40/megaraid_sas_base.c	2011-11-10 11:09:23.461592780 +1100
@@ -5994,6 +5994,7 @@
 	memcpy(cmd->frame, ioc->frame.raw, 2 * MEGAMFI_FRAME_SIZE);
 	cmd->frame->hdr.context = cmd->index;
 	cmd->frame->hdr.pad_0 = 0;
+	cmd->frame->hdr.flags &= ~(MFI_FRAME_SGL64 | MFI_FRAME_SENSE64 | MFI_FRAME_IEEE);

 	/*
 	 * The management interface between applications and the fw uses

And this is what was passed upstream, it got accepted in January.

MegaCli is still the ultimate cause, passing along data structures with the wrong type-flags set, but the driver shouldn’t be passing opaque structures along blindly either. At least one of these issues is actually easily fixable.

Other consequences of the bug

We suspect, but can’t confirm, that this caused filesystem corruption in one of the VM guests. It’s particularly insidious because drawing a correlation between issues on the host and guest is very weak.

It’s worth pointing out that while this was only observed corrupting the buffer_head cache, the bug really has the potential to cause a multitude of other problems. Specifically, various offsets provided from userspace are used by the driver without any checks. If these offsets are maliciously chosen, the driver can be induced to write to arbitrary kernel memory. Not an easy attack if you want something more precise than a DoS, but you have to work for your supper.

Conclusion

That was a really nasty one and a half weeks that we had to deal with KVM hosts crashing at random. Michael’s diagnosis and solution was a monumental piece of work, comprehensively tearing everything apart to definitively identify the root cause. It’s safe to say that this wouldn’t have been solveable in any reasonable way without tools like kdump, crash and SystemTap.

We hope you’ve enjoyed the write-up, there’s a lot to digest. If you’ve got any questions or something doesn’t make sense, feel free to leave a comment or drop us a mail and we’ll do our best to elucidate. Likewise, any general feedback is also appreciated.

0
Comments

We want your feedback, or “How to win a shiny new iPad”

Published March 27th, 2012 by Barney Desmond

There’s been a lot of changes going on at Anchor recently and we’d really love to know how we’re doing. To help us know what you think is important, we’d appreciate it if you take a look at our feedback survey.

The calculator-monkeys tell me it’ll take you less than 10 minutes to fill out. In fact, the average human takes about 7.26 minutes for this survey, you should see if you can beat them. In return, you could win a New iPad™.

Fill out our feedback survey now!

We’ve had a great response from those customers that have already entered. If you’re one of them, thank you very much. If you haven’t yet, we hope you will. We’ll be using your feedback to figure out what we should direct our efforts towards to improve our service even more.


Entries close next Friday afternoon, at close-of-business in Sydney – that’s the 6th of April. If you’ve got any questions or would like to talk to us, just give us a call (1300 883 979) or send us an email (support@anchor.net.au). Good luck!

0
Comments

Winner of the Silicon Beach $500 startup lucky door prize

Published March 23rd, 2012 by Barney Desmond

Thanks to everyone who came along to Silicon Beach last Friday and entered the draw, it was great to see a good turnout and have a chat with everyone.

As promised, we’ve drawn the winner: TJ Tan at Moojive, you’re up!

Anchor is a big believer in helping startups get off the ground, which is why we stump up for a few beers at Silicon Beach. Oh, and we’re also putting $500 towards Moojive’s startup bills. Congratulations, guys! :)

0
Comments

Flying high with TestFlight

Published March 22nd, 2012 by Barney Desmond

Regular readers will have seen that TestFlight went live on our new infrastructure last week. Now that that’s bedded down nicely, we can talk about the more fun, technical aspects of the project.

When we sat down to analyse things we identified a handful of core goals for developing the new architecture:

  • Scalability
  • Consolidation
  • Management

In very-roughly that order of priority. Let’s have a quick look at what we started with, when TestFlight was hosted in the Linode cloud.

Mmm, tiers

Scalability and consolidation go hand in hand in the redesign. TestFlight’s tiered layout was solidly architected from the beginning and that hasn’t changed – clearly defined and separated layers make it easier to design fault-tolerant systems, and let you scale different areas appropriately.

Each tier now has just a handful of really beefy machines. Virtualisation is a fantastic tool for many things, but pulling services out of the cloud and onto real hardware yields much more predictable (and better) performance. This is especially important in the data-storage layer, where cloud services can have atrociously variable latency and throughput.

Incidentally, this can have an interesting on the frontend components – if your frontends have to wait for backend I/O, speeding up the backends can mean quicker turnaround on the frontends and the ability to service requests with fewer frontend nodes.

Alternatively, that makes it cheaper to scale-up. When we started looking at TestFlight we figured out how much diskspace we thought they’d need. Then we turned around and suddenly discovered that they had four times as much data. The architecture and hardware chosen make it easy to deal with this, and with zero downtime. When it happens, we can throw more Dell R510 chassis at the problem, or push it out to some MD-series storage arrays.

Simplification: That’s a high-level view of things, but we’ve also pushed a few changes under the hood. In the interests of simplicity, the mongoDB instance has been shunted out, meaning there’s fewer things to look after. Likewise we’ve pared back the http path for requests, which would previously go through stud, on to nginx, then proxy back to a separate node running their app. nginx now handles SSL unwrapping itself, and proxies to the app on the same machine. Tiers are good, but they can be a little overdone and leads to an extra measure of complexity.

Robustness: TestFlight uses Mysql, Redis and Memcache for data storage, playing to their strengths in different areas of the stack. Master-slave replicated Mysql has been replaced by a single high-availability (HA) instance, making for a failover that should require little or no human intervention in the event of a problem. We’re making use of DRBD for the block storage and pacemaker+corosync for the HA stack.

In addition, Redis and Memcache are now HA services. Previously an outage on those would have either caused everything to come tumbling down, or at best hobble along in a sub-optimal fashion.

“By partnering with Anchor not only do we get their world class expertise, we also get Nagios, Puppet, and 24/7 monitoring for the costs comparable to unmanaged cloud hosting.”
-Trystan Kosmynka

Monitoring: As part of taking TestFlight under our wing we’ve rationalised their monitoring setup. When at Linode they had no less than four service monitors and alerting systems to keep an eye on their app. We’ve tidied this up and replaced it with Nagios for the monitoring and notifications, tied to pnp4nagios for generation of pretty graphs. Make no mistake, the previous solution worked well for their team, but it’s not necessary now that they have someone else to look after things for them.

One of the cool things about doing it our way is the fully automated monitoring setup. Building a new server usually means adding it to your nagios/pingdom/scout config. Instead, we have puppet do that for us. When a TestFlight box gets built, it exports purpose-specific resources into puppet, which the nagios server collects. Removing the human element means it takes less time, and there’s no chance of accidentally forgetting to monitor an important service.

Wrap-up: So that’s TestFlight’s new home in a nutshell. Unfortunately we can’t talk numbers at this stage, but if you’re curious about the technical specifics feel free to drop a question in the comments.

0
Comments

New promo campaign, focusing on the people

Published March 21st, 2012 by Barney Desmond

We’re the people behind the hardware, and now we’re telling the world!

Anchor has launched a series of online ads aimed at highlighting our unique blend of expert yet friendly customer service.


Get Adobe Flash player

 

The campaign features real Anchor staff and uses geek humour to highlight our expertise and the deep knowledge of the Anchor team. It’s not something we’ve done before, so we’re excited to see what sort of response they get.

They’re aimed at CTOs, CEOs and even digital marketers at potential customer firms, and aim to bring home the message of our dedication and expertise while having some fun.


Get Adobe Flash player
1
Comment

Shane Cox: Our new secret sales weapon

Published March 20th, 2012 by Barney Desmond

Anchor is expanding and growing fast right now, and to take advantage of the opportunities out there we’ve taken a big step. We’ve hired Shane Cox, our first ever Business Development Manager.

Shane is an out-and-out sales guy, thrown in amongst a bunch of proud-of-it geeks. How’s he coping? We caught up with him at the end of his first week in the job.

Tell us a bit about your background

I’ve been in sales pretty much my whole life, mostly geeky sales. I’ve sold software systems before and I’ve also worked with companies to improve their overall business processes, helping to streamline them to deliver more efficiency and revenue.

I’ve run my own business in the consulting and HR space, I’ve worked in software sales and I’ve even had a stint on the AV floor of the Harvey Norman / Domayne group.

Cool, so what’s your impression of Anchor so far?

It’s just great to come into an organization where there’s wonderful technical ability and that has such a great reputation in the market. It looks like a casual atmosphere, but these people are dedicated and know their stuff.

What potential opportunities do you see for the role itself?

I think there’s a lot of potential to smooth out Anchor’s sales processes so that the leads we get are translated more often into contracts.

There’s a reason why technical people are often not good at sales – they’re two very different disciplines, but in both there are a lot of things you need to know to get the right outcome.

If we can improve those processes a bit and market our products effectively we’ll take advantage of that great reputation for service we’re building.

You’ve just started – what’s your first order of business?

My first job is to get to know the Anchor products inside and out. The team here has been great at getting me up to speed and making sure my technical knowledge is up to scratch – man, some of those tests are hard!

But when I hit the ground running, I think there’s a lot of potential in our digital agency partner programs, they’ve been quite successful to date. Perhaps there’s also some growth geographically as well. There’s a lot happening for Anchor in the US and maybe Europe at the moment, but there’s also a lot we can do at home.

At the end of the day, my job is to provide new contracts for the company – and I can’t wait to get stuck into that.

Tags: , ,
Posted in FTW

 Leave a comment

0
Comments

Chocks away for TestFlight!

Published March 17th, 2012 by Barney Desmond

We pushed the big red button for TestFlight last night, they’re now up and running on our shiny new infrastructure. If you’re a TestFlight user this is already paying dividends – Trystan, their head-technical-honcho reports “Our concurrent traffic significantly increased as a result of the migration”.

The taskforce we put together for this project has done a great job, and it really affirms TestFlight’s decision to come to Anchor.

We found Anchor by researching scalability and load balancer best practices. We have people all over the world, so we thought why not work with an epic team in Sydney who really know their stuff.

If you don’t know them, TestFlight helps iOS developers beta-test their applications. They make it easy to host and deploy the beta apps, find testers and manage feedback. Right now they support over 70,000 developers, testing 130,000 apps with more than 280,000 testers. If that weren’t enough, they’ve recently been acquired by Burstly and just launched TestFlight Live, a product to track mobile app usage.

At the start of the project we sat down with the TestFlight crew to plan everything and develop the new architecture. Working with Luke, Trystan and Jon was fantastic. They’re sharp guys, really technical, and know their app 100% inside-out – that’s a massive help when you’re about to rebuild everything.

Our boss, Keiran Holloway, said of the project:
We took our usual, rigorously structured approach to the problem. One of the unique challenges we faced with TestFlight was the need to scale up big. The number of devs and builds they host is constantly increasing, there’s additional load coming online from TestFlight Live, and the storage requirements have exploded – it’s a moving target even as we’re trying to plan for it.

Everything’s been running very smoothly so far, you couldn’t ask for a much better transition to a new system. Congratulations on your success so far, TestFlight!

0
Comments

Meet the Specialist

Published March 6th, 2012 by Barney Desmond

Sarah Kowalik

Sarah’s been part of our Customer Support Team for more than a year; we thought it high time you got to know her a little better.

What were you doing before joining Anchor?

I was studying an opto-electronics degree before coming over to the dark side, doing a BSc with an IT major at Macquarie University.

How many generations of your family are geeks?

I’m second-generation geek.

Aren’t you a geek migrant?

Yes, we moved from Adelaide to Sydney, following my Dad’s IT career (he now does sales).

Is it true you’re a member of the Ubuntu tribe?

And proud of it, I’ve been contributing to Ubuntu since 2005. I’ve been involved in many things including release management, and am a core developer of both Ubuntu and Kubuntu. I’ve been sponsored to attend developer summits in Spain and the Googleplex in the US.

Gaming platform: PC

Game genre: Simulations

Reading: Dick Francis on my Kindle 3

Best thing about being one of ‘the people behind the hardware’?

I like taking what the customers tell me, and then waving the magic wand to fix the issue without confusing them unnecessarily.

Tags: , ,
Posted in FTW

 Leave a comment

0
Comments

Q&A with Ryan Brailey

Published February 21st, 2012 by Barney Desmond

Ryan joins us here at Anchor for an eight-week internship. We caught up with him over breakfast to see how it’s going.

Could you tell us a bit about where you’re from and what brings you to Anchor?

Sure. I’m from Camden, and I’m currently in my third year of a B.IT (Networking) at University of Wollongong.

So you’re doing an internship here, what’s a normal day at Anchor like for you?

At the moment I’m working with the Nagios monitoring software and auditing customers’ server support levels: stuff like CPU utilisation, load, memory usage, etc. That and helping our sysadmins monitor and respond to issues.

What’s your next career move?

My degree is in networking so I’m hoping to get into a sysadmin role or something similar. I like being around hardware, but the scope is very narrow if you limit yourself to just hardware.

Your favorite place to work?

The Global Switch datacentre. It’s near the office, just across Darling Harbour in Ultimo.

Anchor – rocks or sucks?

Definitely rocks. My expectations of working here have been met and exceeded. It’s been really impressive to see how much expertise they have, how much custom-written stuff they use to monitor and respond to customer issues.

Last question, where abouts for your next internship?

Here, please. If I weren’t interning at Anchor I just don’t think I’d be learning as much.

Tags: , ,
Posted in FTW

 Leave a comment

1
Comment