Hunting the Performance Wumpus
The scenario is probably familiar to a lot of sysadmins and developers out there: you get a communication in the middle of the night from Someone Very Important that "the system is running slow". Sighing, you log into the server to take a look... but where to start?
As an aid to sleep-deprived geeks everywhere, we've written this article to give you some ideas for tracking down performance problems on your Linux-based dedicated server.
Contents
Why things run slow
A computer system is a complex interaction between lots of different pieces. These pieces work together by passing data down the line, like a car factory. The data moves from one station to the next, getting another piece bolted on. Just like an assembly line, too, the number of cars that can be produced per day is limited by the slowest part of the production line. It doesn't matter that almost everyone can do 10,000 cars a day if one part of the assembly line can only do 10. You'll get 10 cars a day, and everyone else will spend most of their time standing around twiddling their thumbs.
The part of a system that is holding up everyone else is usually known as a bottleneck — when you're pouring liquid from a bottle, the limiting factor for how fast the liquid comes out is the size of the neck. In a computer system, if you want to make things go faster, then you need to identify the bottleneck in the system and fix it. Your system will then run faster, to the limit of the next bottleneck, which you fix too. You just keep repeating this process until performance is acceptable.
Step 1: Find the problem
The hard part of solving a performance problem is identifying who, exactly, is at fault. It could be that some piece of hardware is too slow or undersized for the job, but often the problem is some piece of software doing something grossly inefficient.
Reproduce the problem
While you're tracking down and fixing the problem, you'll be making the problem happen repeatedly. In order to know that you're on the right track, and to judge when the problem is finally fixed, you need to be able to reproduce the problem reliably.
First up, quantify the problem. What exactly is it that you want to improve? Is it the time required to service a single request, the number of concurrent requests that can be handled, or perhaps the number of requests per second that can be dealt with over a sustained period. The crucial thing is to know exactly what it is that you're trying to optimise, so you can tell when you've succeeded in your quest.
Once you know the problem, you need to work out how to make it happen "on command". If your website is running slow, for instance, then the chances are that making a particular web request (or combination, or volume, of web requests) is going to be the way to reproduce the problem.
If your initial attempts to reliably reproduce the problem don't bear fruit, consider external factors. If the site is only slow at certain times of the day, for example, perhaps a background script is running periodically and consuming significant resources?
Script whatever is required to reproduce the problem you want to fix, so that re-testing the system is as simple as running one command. The more you have to do to re-run the test, the slower your testing will be, and the more bored and frustrated you'll get. A test that is quick to run and return is better than one that takes hours, too, for much the same reasons.
View from the top
Once you've got some way of causing the problem to happen "on command", then you can start tracking it down. My first "go to" tool, regardless of the sitution, is the top command. This handy tool gives you a great deal of information about the state of the system. It's not very detailed, for the most part, and the information isn't always 100% accurate, but it's almost always enough to point you in the right direction. I think of it as my "performance monitoring tourist information centre" — it's rarely where I want to end up, but there's a helpful guy there that'll tell me how to get to where I do want to go, and it has all sorts of useful maps of the area.
To run top, just login to the server (using SSH) and run the command top — it's that simple. It'll give you a whole screenful of information that you can interpret to send you on your way.
There are many parts to top's output; the following is a list of the parts of the output that I find useful, and what they mean.
Load Average: In the very top right hand corner is the label load average, followed by three decimal numbers. The load average is simply the average number of processes that are waiting to use the CPU, averaged over the last minute, five minutes, and fifteen minutes respectively. When the numbers are all about the same (whether high or low), then the load on the system is consistent over a long(ish) period of time. When the first number is larger than the others, then the load is rising, and when the first number is smaller, the system load is falling.
A common problem people have is that they try to read too much into the load average. Don't try to make direct comparisons (especially ratios) between load averages at different times; it's not going to give you good advice. Also, you can have a poorly performing system with a low load average, and an acceptably performing system with a high load average, so higher isn't always bad news. The load average is just there to tell you, right now, whether there's a lot of different processes running.
CPU Usage: The third line (and possibly the next few lines too, depending on your version and configuration of top) indicate how your CPU(s) are being used. The summary version is a single line labelled "Cpu(s)", while the detailed (per-CPU) information is given in multiple lines, labelled "Cpu0", "Cpu1", and so on. You can toggle between the two modes by pressing 1.
The CPU info line(s) have a bunch of different percentage numbers. Different versions of top have different sets, but the common ones (and those which we're typically interested in) are:
User time (us): This is how much of the CPU's time is spent processing userspace code (that is, your program, standard library calls, that sort of thing). If this is at or near 100%, then something on the system is burning a lot of pure CPU time — either your program, or some system management program, and it's doing things that are "pure computation" — not I/O, just calculations.
System time (sy): This is how much of the CPU's time is involved in doing things in the kernel. The kernel manages the disks, network devices, console (keyboard/monitor), memory management, and so on. If this is high then some program on the system is doing a lot of "kernel level" things, like I/O.
Waiting time (wa): As you probably know, disks aren't nearly as fast as the rest of your system, so when you need to get something off a disk or write something to one, the system has to wait a relative eternity for it to finish. The "waiting time" is just the percentage of time that the CPU is spending actively waiting for the disk to finish doing it's thing. If this is high, then something is working the disks hard — either an application, or the dreaded swap (bom bom, bommmmm).
Idle time (id): How long the CPU spends just lounging around, not doing anything productive.
Memory Usage: Just below the CPU info is a couple of lines, labelled "Mem" and "Swap". These lines give you some (very) basic idea of where memory is being used on the system. By themselves they don't tell you very much, but if your CPU percentages are skewed in certain ways, they can be of some use in narrowing down the root cause. These tests are only really valid when there is very little (less than about 20,000k) free main memory (the third value in the first line). If you've got lots of free main memory, then something which was using a lot of memory probably just exited, and your analysis won't be accurate.
High waiting time, very low buffers/cache: If the CPU is spending a lot of time waiting for the disk, and you have very low values for the buffers and cache (typically less than 20,000k each) then it's likely that part of the reason why your disks are slow is because there is very little memory available to cache disk data, and so the system is constantly going back to disk to re-read data it just read little while ago. Reduce your memory usage or install more RAM.
High waiting time, large amounts of swap used: The chances are that the system is swapping heavily (that is, writing a lot of pages of memory to disk so that other chunks of memory can be read from disk and worked on by a program). Again, reduce your memory usage or install more RAM, because your system is running out.
Process list: The list of processes that are running on the system is what takes up most of the space on the screen, starting just below the "highlighted" line. The contents will change every couple of seconds, as top collects a new set of data and displays it for you. By default, the list of processes is sorted by the amount of CPU time (the sum of the "user" and "system" CPU time) they're using, although you can change the sort order to pretty much anything you like. Things to look out for in the process list are:
One process using 100% (or more) of CPU: If the CPU info showed that the CPUs were largely being consumed in user time, then there should be one (or more) processes in the list that are using all that CPU time, and they should be at the top of the list. If you're bottlenecked on CPU, knowing which process is using it all is obviously crucial.
One process using most of the memory: If the CPU and memory usage showed that there was memory pressure, then you should press M (capital-m) to sort by memory usage. The big memory hogs should be clearly displayed at the top of the list. Don't worry so much about exactly how much memory they're using; it's more useful to identify who the hogs are, so they can be optimised. What's nice about this view is that you can watch it over time as your program runs, so you can see (broad) trends in usage, which can be helpful in tracking down why the memory is being used, and how to fix it.
Hardware limitations
Most of the time, performance problems are prima facie caused by insufficient hardware resources. When you identify that a particular hardware component is the bottleneck, you either need to reduce the amount of work that that component does, or else you need to upgrade the component so it can provide more performance. Since the amount of performance you can buy in any class of hardware has a ceiling, it's important to be familiar with the techniques for optimising your software's use of system resources.
Even if you're confident that you can optimise your application, though, what do you optimise for? There's not much point in trying to use less memory if your bottleneck is transferring data over the network. So, to complement the section on interpreting top, above, here's my checklist of hardware components and how to work out if it's the problem, in rough order of what's most likely to be the cause of a random performance bottleneck.
Memory Capacity
This is, by far, the most common performance problem I've seen. Programmers are generally taught to program like memory is infinite, and that there's no reason to optimise for memory usage in the general case. This is pretty good advice on the whole, as premature optimisation is the root of all evil — but when you hit a problem which is larger than your available memory, all hell breaks loose.
Initial indicators of a machine being overloaded memory-wise are that the cache and buffers in top are small (anything less than 30,000k or so is "small" in this case), a fair chunk of swap is being used, and the system has high IO wait (due to all that swapping). Run sar -W 1 0 to verify that the system is, indeed, actively swapping (as opposed to doing a lot of other disk activity) and if you get constant non-zero values in that output, then you've found your white whale.
Hard disk bandwidth
Hard drives are by far and away the slowest component in your system, and it's very easy to massacre your disks without really realising it — lots of things on the system need to do things on the disk, and sometimes your application doesn't even do anything to the disk directly (think database queries, or debug logging) but still causes lots of disk activity. It doesn't take much to kill performance with some ill-chosen operations. It's easy to blame "the database" for being slow, when in fact it's actually that your hard drives have been worked so hard they're now molten slag on the bottom of the case.
Performance problems related to hard drives are often very closely related to those caused by memory exhaustion (above): running out of memory causes swap to be used, which increases load on the disk and reduces the memory available to the disk cache which, again, increases how hard the disk gets driven. Conversely, a little bit of light swapping isn't the end of the world for a lightly loaded system, but if you're swapping onto a disk that's working pretty hard already, that can really kill performance.
To diagnose a problem as being hard drive-related, top's IO wait percentage is again the first stop. Consistently high IO wait means lots of disk activity. If there's plenty of cache/buffers, and sar -W 1 0 shows lots of zeroes (and possibly the occasional blip) then the disk is getting thrashed, but it's not swap. Running iostat -dx 1 will show you all the partitions and how hard they're working (look at %util). If %util is consistently at or around 100 for any partition of disk, you can definitively say that the disks are getting thrashed.
One side note on disks getting thrashed, though, before we move on. If the disk has high %util, but the actual throughput (rsec/s and wsec/s) is pretty low, then it's possible you've got a hardware fault or RAID rebuild going on. A hardware error might show up on a smartctl run (smartctl -a /dev/sda or whatever), looking at things like the reallocated sector count, but SMART isn't real, well, smart, so don't trust it too much. A RAID rebuild should show up in your RAID management (you are monitoring your hardware RAID setup, aren't you?). A software RAID rebuild will be shown in /proc/mdstat.
CPU speed
Modern CPUs are stupendously fast, with several concurrent cores, and it is really quite rare that the true bottleneck is the CPU. It does happen on occasion, though, mostly with programs that do a lot of number chrunching or other data processing rather than external interaction. If you've got a few of these sorts of processes running, your other processes can wither a little due to lack of regular CPU time.
The symptoms of CPU exhaustion are that the user and system components of your CPU time are consistently at 100%, with zero idle time and near-zero IO wait, with the CPU usage dominated by one process (or a related set of processes).
I have never seen a web application, in and of itself, peg the CPU. I usually see this situation when there's CPU intensive background processes running (like video transcoding, that seems to be all the rage these days).
Network bandwidth
The funny thing about network issues is that they're frequently blamed when they're not at fault, but when the network is the actual root cause of a problem, it often goes undetected, with some other component getting the blame. Actual network problems fly under the radar mostly because the things that you care about use the network, but you rarely have much to do with it directly.
Detecting that you've got a network bottleneck is something of a process of elimination. If you've got a definite slowdown, but your IO wait is low and your user CPU utilisation isn't much chop, then there's a chance you've got a network problem.
Ultimately, you're best off running sar -n ALL 1 0 and looking at the rxkB/s and txkB/s columns. If that's getting up anywhere near 80% of your network's nominal capacity (remembering that sar will give you kilobytes per second, while your network is almost certainly rated in bits per second), then you've probably got a problem. If other things are talking to the machine you're talking to, then at 80% you definitely have a problem. Also, divide the rxkB/s and txkB/s by the corresponding rxpck/s and txpck/s, and if that number is less than about 0.8 on a network with significant traffic, then it's possible (though not certain) that something out there is choking on all the tiny packets you're sending.
If you've got an apparent network problem (your app has been demonstrated to be waiting on the network) but your network traffic isn't really all that high, then either you've got errors (again, sar -n ALL 1 0 will show those statistics), something else on the network is chewing all the bandwidth (find it and kill it!), or it's not the network that's your bottleneck, it's the performance of whatever you're talking to on the far end of the network. Restart this analysis on that machine, or e-mail a copy of this article to whoever's responsible for that machine, and get them to fix it.
Memory bandwidth
This is the most esoteric and by far the least likely performance problem out there, but I've seen it once or twice, so it's worth mentioning. People don't really consider that main memory ("RAM") is significantly slower than the CPU, and in certain corner cases, this can cause problems.
Basically, if your CPUs are maxed, and you replace just the CPUs for significantly faster versions, and you get much the same performance as before (including the apparent CPU bottleneck), then there's a reasonable chance that your RAM is the bottleneck. The problem, of course, is that there isn't a lot of options for any given motherboard/CPU as far as RAM speeds go. If, for some reason, you do happen to have the slow version of a particular form factor of RAM, then give the faster version a go. Sometimes it'll work, sometimes it won't. If it doesn't, it's likely that nothing short of a newer chassis (mobo/CPU/RAM) will clear the bottleneck, or else just not hammering the CPU so much.
Bugz in the codez
There's one performance problem that I want to touch on that is totally unrelated to hardware, and it will probably stump you as thoroughly as it did me the first time I hit it.
The symptoms are puzzling — typically no hardware component is even minorly implicated in a performance problem (lots of spare memory, disks near-idle, CPUs not doing much, and network lights barely flickering), yet it takes some ungodly amount of time for your application to respond.
If you're making calls to external web services, it's pretty likely that they're at fault. If the performance problem is transient, then upgrade "pretty likely" to "almost certainly". Do some manual requests to the web service and see what the performance is like, then hassle the admin of that machine to fix things up, or consider getting your information some other way.
That's not buggy code, though. The real performance entertainment comes when you're using synchronisation primitives in heavily threaded code, and you're not doing it particularly well. This seems to hit Java- and .NET-based web services particularly hard, because they've drunk the thread-based concurrency kool-aid heavily. What is almost certainly happening is conceptually simple, and fiendish to prove: each of the active threads is constantly waiting for some other thread to finish with some critical resource, so everyone spends more time saying "May I?" and bumping into each other than actually doing useful work.
What's really insidious about this problem is that it rarely-to-never shows up in testing, because you're typically hitting your app only once or twice in parallel. There, you don't see the problem, because there's no contention for the critical resource. It's only when you're trying to service a number of concurrent requests that things get ugly. You can even get through light loads without apparent difficulty, again because the number of lock contentions versus work is low. But once you hit some threshold where lock contention gets heavy... look out. It gets worse because as things slow down, each request takes longer, which creates a backlog, which increases contention for the critical resource... and you get the idea.
If you're not the developer of the software in question, then tell the developer to fix their code. If you are the developer, then fix your code!
Step 2: Fix the problem
Having identified which aspect of the system is at the heart of the problem, you can now work towards fixing it so it's no longer a problem. Your choices are either to improve the performance of the hardware component that is the bottleneck, or modify your software so that it relies less on that hardware component.
Upgrading or otherwise tuning the hardware is the quickest and usually the cheapest solution, but you can only upgrade your hardware so much before you get to the limit of available performance. Spending developer time fixing the problem on the software end is somewhat more costly up front, but usually provides better long-term benefits, and in many cases can provide orders of magnitude performance benefits if your developers are any good.
We'll go through each of the hardware components that were identified above, and describe how you can upgrade the hardware or modify your software to relieve the pressure on that component. If you identified which component of your system is causing the problem previously, jump to the corresponding section below for tips on fixing the problem.
Memory capacity
If you've got insufficient memory, then try adding more. It's cheap as — well, memory chips — and quick to do. If your machine is a couple of years old and is maxed out on RAM, upgrade to a new chassis. Server motherboards these days can accomodate at least 64GB of RAM (if not 128GB or 256GB) and RAM prices for the larger sticks will come down before you need to upgrade again, giving you memory options into the future.
My general answer to the question "how much RAM should I add?" is typically "how much can you afford?", however at the very least you should add as much RAM as is currently in swap, rounded up to the nearest whole stick of memory. Double that if you want a bit of headroom.
On the software side, memory overconsumption is easy to do, and sometimes not so easy to fix. There are two things which software developers are taught, very early on, which cause most memory consumption issues:
- Assume you've got infinite memory; and
- Garbage collection will solve all your problems.
These are good rules of thumb, because they prevent premature optimisation stopping any useful work getting done, right up until they rear up and bite you in the arse.
If the memory usage of the application starts small and gradually but inexorably rises, then it's probably memory leaks causing the problem. Software written in a garbage-collected language isn't immune to memory leaks, they're just a bigger pain to track down. Explicit memory-management languages, like C, are the golden posterchildren for memory leaks, and there are a plethora of tools available to find the problems. Similar tools exist for most popular GC languages (like Java and C#), but using them all is beyond the scope of this article.
Assuming that you've got infinite memory and don't have to worry about the size of your data structures is the more entertaining failure mode. If your program either starts off consuming inordinate amounts of memory, or it runs along happily with a constant, reasonably size for a while, then suddenly leaps up and grabs a monster chunk of RAM, then it's likely that you've got a data structure problem.
Unfortunately, profiling for overly-large data structures isn't something a general purpose tool can really help with, as what's at fault and what's just regular "big" is very application specific. You really need to know what the program is doing, and why, to see what is unreasonable growth. The problems I've seen in the past, though, include gems such as:
Trying to work with too much data at once. If you needed to take the average of a set of numbers stored in a database, or process the contents of a log file, you might be tempted to read all the data into memory at once and then do your processing. This works fine for small data sets, but when you suddenly need to average a few trillion numbers, or process a 20GB log file, you're stuffed. Instead, in cases where you don't know in advance how much data might be involved, try reading and processing in small chunks (one number at a time, or one line at a time). It is a rare bulk data manipulation algorithm that cannot be chunked in some way.
Redundant data. When you need to relate a piece of data to another piece, refer to the original rather than taking a copy. With enough cross-referencing, you can end up with hundreds of copies of all your data, and you will be consuming orders of magnitude more memory than you need. The worst case is when every object has a copy of every other object; in this case, adding an Nth object to the set means that N-1 duplicate copies of that object will be created, which for a large object (and/or large values of N), you're going to be chewing huge amounts of memory.
If you don't need it, don't keep it. Some people are pack rats — they feel a need to keep far too much junk, on the off-chance that one day they might need it again, and then you'll see that they were right to keep it all... Caching intermediate results isn't a bad idea, in moderation, but it's easy to take it waaay too far. Analyze what is too expensve to re-compute, and only cache that. Also implement a cache purging mechanism, so that you don't have infinite cache growth. If this all sounds like too much work, either punt to the filesystem (cache in files), a dedicated in-memory cache program (memcached), or just don't bother caching until performance really warrants it.
Disk is too slooooow. Related to "If you don't need it, don't keep it" and "Trying to work with too much data at once", this typically occurs when someone was trying to pre-optimise the performance of the program, and assumed that since disk is slow, it would be better if we never read from it, and instead read all the data files into memory at once before the program started doing it's thing. This is a nice idea, but when your data files get large, the theory breaks down. More intelligent on-disk data structures, facilities like mmap(2), and using the operating system's disk cache, are all better strategies than pre-fetching everything "just in case".
It's possible, too, that it's not your program that's taking all the memory, but is instead some system process. Databases are the most common culprit, and usually can be tuned to use less memory, although that might have bad effects on database performance.
Hard disk bandwidth
There's a few options on the hardware side here, from faster hard drives, more hard drives, to esoteric things like battery-backed write caching. We've got a skeleton article on tuning disk IO performance, as well as an article on the best way to benchmark your disk IO, so that you know if the changes you're making are helping or hindering.
Of course, there's not a lot of benefit in optimising your disks if you don't know which of them is causing the problem. You can see what every block device is doing using iostat -dx 1 (sar -d 1 0 provides similar output, but I prefer the layout of iostat for this job).
What will come out of iostat every second will look something like this:
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.35 0.00 0.23 0.03 4.70 20.26 0.00 1.78 0.50 0.01 sdb 0.05 3.03 0.07 2.56 16.17 44.69 23.14 0.15 58.30 13.77 3.62 dm-0 0.00 0.00 0.03 2.29 0.79 18.31 8.23 0.25 108.29 3.16 0.73 dm-1 0.00 0.00 0.08 2.56 14.86 20.46 13.37 0.15 56.26 11.18 2.95 dm-2 0.00 0.00 0.01 0.74 0.51 5.90 8.60 0.15 206.40 1.79 0.13 dm-3 0.00 0.00 0.00 0.59 0.03 4.70 8.04 0.00 1.50 0.20 0.01
The important columns, initially, are Device and %util. If the %util for a given device is consistently high, then that device is getting hammered. Devices in this list can be "layered"; that is, if you have a logical volume (which is contained on a partition on a disk) that is being overworked, all of the logical volume (dm-N), the partition, and the disk will show heavy usage. Software RAID volumes will also show several partitions and disks with the same usage patterns.
Irritatingly, iostat usually doesn't turn it's block device references into useful mount points. However, a bit of manual inspection of fstab will tell you what a regular block device maps to (sda1 or md1, for instance). For dm-N devices, you need to run ls -l /dev/mapper. The N in dm-N will map to a number just to the left of the date (which is the device minor number), and the corresponding filename should make it obvious which logical volume is at issue.
iostat will tell you which disk partition is at fault, which can be handy if you're looking at upgrading some disks. If you're out to find the culprit process, though, you're probably more interested in which process is generating the load, and for that you want something like iotop or blktrace. We've got a whole wiki article on using IOtop and blktrace to trace IO usage on Linux, which will give you all the help you need in narrowing down who is causing the problems.
If it's your program, then run through our article on using the disk more efficiently, and see if any of that could help solve the problem. If you're using a database server, though, and that's taking up most of the IO, then you'll want to look into perhaps configuring the database server better, or optimising your queries or schema to be more efficient. Most database servers come with really pessimistic default configurations, and need to be hand-tuned to your workload in order to work most efficiently. The schema and queries, also, are by default a fairly poorly optimised area, and could usually do with some optimisation work.
CPU speed
If you truly have a CPU-bound application, then congratulations: it's rare for a single program to fully utilise a CPU core these days, so you're a member of an elite club.
Hardwarily speaking, there's not much to be done other than buy bigger, faster, better CPUs. The traps for the unwary are numerous and subtle, though, so don't just leap into a numbers war. CPUs from different manufacturers, as well as different models and even variants in a single product line, can all perform significantly differently depending on the workload they're being given. It's worth benchmarking your particular workload on a few different CPUs to get a feel for how they compare in your situation.
On the software side, you may be able to trade space for time by caching data on disk or in RAM, or reducing the amount of work your app is doing by some clever code optimisations. Using algorithms that are more efficient, or that cost more in storage while using less CPU, can provide massive (orders of magnitude) performance improvements on large data sets. Don't make these sorts of changes blindly, though — have good profiling data, and strong representative test cases, to back up your experimentation. It is very easy to micro-optimise your code to run faster in one situation, at the expense of killing the performance for a number of other scenarios. Don't shoot yourself in the foot.
For certain workloads, utilising parallel processing can give big speed improvements too. With multi-core CPUs becoming the norm, if you're not already using multi-threaded or multi-process models, the chances are you're not using the full potential of your system. Parallel programming isn't really something that can be exhaustively covered in an article like this, but there are plenty of resources online, for practically every programming language under the sun, to help you get started.
Network bandwidth
The hardware upgrade for a saturated network is pretty simple: a faster network connection. If it's a LAN interface that you're filling up, then upgrade to Gigabit ethernet if you haven't already. Turning on jumbo frames, if your hardware supports it, can provide a good performance increase too. If you are saturating a gigabit network, then you're doing awfully well. If you really need more bandwidth, then take a look at channel bonding, where multiple physical network connections are aggregated together to give a larger "virtual" connection. At some point, though, your bonded connections aren't going to be the bottleneck any more, CPU or bus IO will be, so consider software fixes instead.
Connections to the Internet are usually much slower than your LAN, and hence are more likely to get congested. You can buy a bigger pipe, but that can get costly very quickly.
If there's lots of things going on on your machine, then you can identify who is hogging the network using iftop, which shows you a top-like list of network consumers.
Fixing up your software to do the same job with less network traffic is an interesting game of trade offs. First off, what is it that is actually in all that traffic? Is there anything that can perhaps be left out? Not sending something at all is always the best optimisation.
If the traffic is essential, try reducing the global impact of the traffic, by placing the services that need to talk to each other closer together, network topologically speaking. So, if the machines are talking over the Internet, move them into the same data centre or give them a dedicated link they can use. If they're on the same LAN already, look at putting the services on the same machine so they can talk using local sockets rather than network sockets.
Can the data be compressed, perhaps? A lot of protocols, especially common ones like HTTP, have an option to compress the data before sending it over the network, a practice that can reduce network traffic by up to 70%. Compression trades CPU time for network bandwidth, which is typically a good tradeoff. If you're using your own hand-crafted network protocol, work some compression in there right from the start, to save yourself the annoyance of retrofitting it and supporting graceful degradation.
If the communication is repetitive in nature, think about caching at the recipient end of the connection. This trades local storage usage (RAM or disk) for network traffic, which is often a good trade off if you can manage it. However, if the data becomes stale too quickly, or you're not up to implementing a good cache replacement and expiry algorithm, consider your other options first.
Step 3: Repeat
Once you've improved the performance of your application by removing a bottleneck, it will run faster than it did before. However, there will again be a limit to how fast you can process requests. If your system performance is not acceptable after a round of tuning, you'll need to repeat the whole process of determining the bottleneck and removing it. This may need to happen several times until you get your system to the point where the performance is acceptable for your purposes.
A word on clustering
One thing I haven't mentioned at all in this article is clustering your application for performance reasons, sometimes referred to as "horizontal scaling". This is because changing an application architecture from a single machine to a clustered setup is not an undertaking to be considered lightly. The way you design something like that is very, very different from the way you put together a single machine web service, and dealing with all the consequences and corner cases is something that is far beyond what can be covered in a general article like this.
There is, however, a definite limit to how much you can do on a single machine, and if your application is starting to push the boundaries of a single machine, and you expect further growth in the future, then find someone who knows about these things and start talking to them now. Anchor provides large scale clusters for several of our customers, and we can give you plenty of advice on how to structure your system to take advantage of a cluster of machines.
Further Reading
Jeff Atwood of Coding Horror has a Windows-oriented version of this same topic, discussing how to use the built-in tools in Windows Server 2008 to find performance problems.
