After a few conversations with Anchor, it was clear their team was a phenomenal fit for TestFlight. We would love to take credit for everything.. but the reality is that Anchor brought a wealth of experience to the table. They took our stack, analysed it and came up with a plan of attack which involved some fantastic technologies only well-bearded individuals should touch.”
−Trystan Kosmynka – Co-Founder and CTO, Test Flight
Making an iPhone app is easy. Making a great iPhone app that millions of people use and enjoy every single day is much harder.
That’s what TestFlight helps developers do, by hooking them up with beta testers and giving them the tools to manage feedback and fix problems.
TestFlight came to Anchor when they knew they needed help with growth. Their developers were spending too much time doing sysadmin work, which was hurting productivity and turnaround on new features. In the same way that TestFlight helps devs avoid distractions and focus on what they’re good at, our job was to make TestFlight’s server-worries go away.
We sure weren’t disappointed by what we saw: With the launch of their new TestFlight Live product at around the same time, we saw activity on the servers jump several-fold straight away, and it’s only been accelerating since then!
Getting our feet wet
By far the biggest challenge facing the TestFlight architecture has been their job-processing system. Every time new data comes in a few data-crunching jobs are created in TestFlight’s redis datastore. These jobs are processed asynchronously and the results pushed into a MySQL instance. TestFlight uses Celery for their job queueing. It’s an ideal solution because results are needed soon, but not right now, and it lets them scale horizontally.
TestFlight is pushing some serious data – we’re talking several hundred new jobs incoming every second and a few tens of millions processed every day. Celery works fantastically, but TestFlight were hitting the wall. Our analysis of the problem showed that lock contention, due to their historical use of the default MyISAM storage engine, was limiting the rate at which jobs could be processed.
MySQL is respectably quick, but MyISAM is terrible when it comes to multiple concurrent accesses. In short, the workers were spending a lot of time twiddling their thumbs while waiting their turn to get at the database.
Converting tables to the InnoDB engine brings some great performance benefits in terms of concurrent access, but conversion takes a while and normally blocks access – that’s a killer for a site that’s active around the clock.
This is where the Percona toolkit comes to the rescue, with the “online schema change” script. When you’re dealing with tables that are several tens of gigabytes in size it’s a lifesaver for getting things done without downtime.
With some of the most heavily used tables converted, utilisation improved dramatically as workers were no longer waiting around for their turn. We were able to add more worker processes and watched the throughput climb accordingly.
Based on our before-and-after measurements we saw a 3x increase in peak throughput, and much better utilisation of CPU resources, which were previously sitting idle. Not bad for some planning and brainpower.
The best thing about the changes is that they made effective use of the servers they already had, needing only a modest investment of time and no additional money. For a young company (or even a mature one) this sort of agility is especially valuable in staying competitive.
Capacity planning for the future
In making these improvements for TestFlight we’ve reshaped their use of computing resources to even things out, meaning they get better value from hardware purchases as they expand.
We’re constantly provisioning more capacity for TestFlight thanks to their fantastic explosion of data to store and process. We’ve no doubt there’s changes in store for Testflight as they grow, and Anchor will be there to guide them.