Know Thy Enemy
Published August 4th, 2009 by mattLong before any code gets written or any servers deployed, a quiet yet crucial job is being performed. The poor tech who is doing this work won’t get much credit, and almost certainly none of the glory, but if this job isn’t done properly, then none of what gets done later will be of much use.
I am, of course, talking about… requirements gathering (bom bom bommmmmm).
In the case of project starbug, the requirements gathering work sits at the “fairly straightforward” end of the spectrum — but it’s by no means easy. What makes the job easier than average is that the site is currently operational, and our primary job is making sure that the new server farm that we’re building will (a) match what is currently running (in terms of system setup), and (b) have enough capacity for future growth.
The system configuration to support the customer’s application is fairly easy to achieve for this project — the customer knows their software and what it requires really, really well, and we’ve got the existing setup to examine to try and work out how the pieces fit together if it isn’t immediately obvious. The main requirement here is to make sure that all the requirements are documented thoroughly. Yeah, writing docs isn’t the glamourous end of the job, but it is important, and is something that pays dividends down the line. More on that in another article, though.
The capacity issue is a lot trickier. The new architecture we’re building is completely different to the current architecture (which isn’t scaling well, hence why it’s being left behind), so it’s hard to draw any direct performance metrics by just looking at what hardware is already in use (especially since the current setup uses virtualisation a little too heavily, which makes comparisons based on hardware spec even harder).
Based on a cursory examination of the bottlenecks in the existing system, along with previous knowledge of the system behaviour, I decided that the primary bottleneck of the system is disk I/O. This site isn’t your typical large-scale website; it does a lot more file management than is typical. As a result, the key thing we need to ensure in our new hardware setup is that there is sufficient disk I/O capacity.
Memory constraints (large app servers, mostly) take a close second in the “what is going to kill us here” stakes, as the current infrastructure is using somewhere north of 100GB of RAM (spread across all the various servers that are being used). We want to provision this plus some extra, as moah RAMs == moah disk caching, and moah disk caching == better effective disk I/O. Win all round!
CPU, on the other hand, is practically never an issue. The servers run a lot of separate processes, but they’re almost always waiting on stuff coming from the disk, so with the current state of the art in server CPUs being quad core, we really shouldn’t have a CPU bottleneck.
Although I said earlier that memory and disk I/O were tied for the title of “biggest performance bottleneck”, there was really no competition for which one of these was going to keep me up at night. Solving the memory problem is easy — modern chassis can easily accomodate 32GB (or more) of RAM. There was never any doubt that we’d be using at least a half dozen machines, so stocking them all with 32GB of RAM should be plenty.
No, the worry was always going to be the file I/O capacity, and making sure that we had both the speed we needed, as well as the storage capacity. While the site doesn’t need petabytes of storage, it does need a decent amount of space, and it all needs to be pretty quick. What’s annoying (but understandable) about storage systems is that you can either have a lot of capacity (1.5TB SATA drives are common as MCSEs) or you can have a lot of speed (15k SAS drives max out at 300GB). We could get the storage space we needed with 300GB drives, but will it be quick enough?
To try and make some sort of an apples-to-apples comparison, I needed to have a number that represented how much I/O was being done at present, and which could be compared to what our new hardware infrastructure is capable of.
In the end, what I went with was running the sar tool on a number of the existing machines to try and get an idea of how much disk I/O is being requested by the machines. There are a number of things that might make this comparison inaccurate, but in the end I decided that there wasn’t really any better metric.
The key thing was to try and get the statistics at the same “layer” of the stack in both cases — in this case, when the kernel passes the I/O request off to the disk (or RAID controller, in this case). The benefits of this are that it’s a single statistic to compare, and it’s not ridiculously impossible to synthesise a load at this level for benchmarking purposes (obviously, running the live site on the new infrastructure to benchmark the new hardware isn’t a real winning strategy). When all’s said and done, though, these benchmarks are an estimate, and are unlikely to be completely accurate. That needs to be kept in mind when doing the hardware estimations later on.
All of this information gathering and benchmarking takes a pile of effort, but without it there’s no chance whatsoever that any sizeable infrastructure will be correct for the job it needs to do. I was surprised in this case at how little hardware we ended up needing, however on a previously sized system I worked on the initial guesstimates turned out to be an order of magnitude too low (the system ended up with some thirty-odd servers instead of the five initially ordered). Without a comprehensive analysis of the reality of the situation, you’re either going to end up with a poorly performing site, or else a pile of unused hardware.