Andrew Tridgell’s rsync utility is widely used for pushing files around between servers. Anyone can copy files across the network, what makes rsync special is that it compares the files on each end and only transfers the differences, instead of pushing the whole file across the wire when only a few bytes need to be updated.
We use rsync to backup servers every single day, though we recently found a few big files where rsync was going to take an eternity to perform what should’ve been several hours of work. So, we busted out the butterfly nets and went to catch us some bugs.
rsync in a nutshell
Most servers don’t change a great deal on a day to day basis, so transferring just the differences is a very smart way to perform nightly incremental backups. You just copy yesterday’s image and update the small parts that have changed, and rsync is the perfect tool for this.
A newer development in our backups is use of the btrfs filesystem, which we talked about a little earlier this year. One of btrfs’ killer features is copy-on-write snapshots, which effectively give you copies for free and only stores the changes when you modify data. In normal usage, rsync makes a new copy of the file with the changes applied, but this defeats btrfs’ copy-on-write benefits. To make this right, we use the
--inplace option, which tells rsync to write directly to the original file, letting btrfs work its magic.
Assuming that a relatively small proportion of your data changes every day, what this lets you do is store many backup snapshots while only consuming modest amounts of diskspace. This is especially important for heavy hitters, such as database servers.
Using –inplace changes things
An inplace copy can be problematic because the changes on the receiving side are non-atomic. If you cancel the operation in the middle of a transfer, the destination will only be partially written and can’t be rolled back. This is fine for backups, but it points to rsync needing to change its behaviour when referring to reused chunks of data.
To give an example, imagine a file containing chunks of data A, B, C and D. File 1 contains these chunks in order, and already exists on the receiving end. File 2 exists on the sender’s end and contains the same chunks, but in reverse order, DCBA.
Now imagine that we want to copy File 2 to File 1. In normal operation, rsync will populate a temporary file with the necessary data, and replace the target file with the temporary file once it’s completed. It’s very efficient because we don’t even need to send any real file data across the wire, so long as we have room to build the temporary file on the receiving end.
So far so good, what’s different about using
--inplace? When data is written inplace, earlier parts of the destination file are unavailable to later parts of the same file (they’ve already been overwritten). The resultant operations looks something like this:
It’s not ideal, but gets the job done without turning this into an NP-hard compsci problem. rsync can reuse some of the data on the receiving end, and then needs to re-send the data for the latter half of the file. rsync is able to keep track of which chunks are still “available” and figure it out for us.
Things get interesting when we start thinking about how to transfer this knowledge across the wire.
Building a hash table
If you’re a regular reader you’ll recall that we talked about a hash DoS attack against btrfs a little while ago. Some of those details are going to sound quite familiar.
During the initial negotiation, the receiver builds a hash table representing its copy of the data, and transmits it to the sender. When the sender starts scanning its copy of the data, it checks whether the receiver already has a particular chunk by looking-up the hash table and directing it to use the data it already has, or sends new data to the receiver. To match blocks of data that don’t perfectly align with the chunk size, rsync also uses a rolling checksum on the sender’s side that is calculated for every byte.
This sounds like a lot of work, but the rolling checksum is efficient to calculate, and checking for matches in the hash table is usually a fast operation. Mostly, it’s considered more efficient to calculate lots of checksums because the network is assumed to be slow and costly.
That’s a normal file. The chain in each bucket is very shallow, and the buckets are all roughly the same size, so performance is fast and consistent. Now what about a special file, like a big cat picture? A reeeeeally big cat picture, with lots of redundancy in the middle.
Now we’re talking. The chain for Midsection chunks is huge because longcat is so long and consistent. This isn’t bad on its own – for a regular copy (not
--inplace), rsync will immediately find usable Midsection chunks so there’s no penalty for having an imbalanced hash table.
Where it all falls down
We’re going to put together the pieces that we’ve discussed: We know that traversing a long chain in the hash table is costly, compared to the expected case of just one or two items. We also know that using
--inplace forces rsync to disregard chunks that are “behind” the point in the file that’s currently being written to (they’ve been dirtied by the sender). Those chunks will need to be sent across the wire, once the sender finishes searching the hash table for possible matches.
The problem we ran into occurs under fairly specific circumstances:
- The destination file is already “poisoned” by large numbers of identical blocks, resulting in a heavily imbalanced hash table
- We want to sync many changes that would use those poisonous blocks
- The region to be synced occurs after all of the poisonous blocks, ie. we’ve made changes to the end of the file
This might sound unlikely, but it’s not unreasonable. In our case it was a 1.2 TiB MySQL data file, which results in roughly 1.3mil buckets and 1mil chunks. There happens to be a lot of zeroes early on in the file, which creates the imbalanced hash table – many of those 1mil chunks will be chained in a single bucket. Most of the activity in this MySQL data file occurs at the end, where more zeroes had been written on the sender’s side. rsync hits this section of the file and is calculating the rolling checksums as normal. For each checksum, it’s referring to the hash table, hitting the all-zeroes chain, then furiously traversing the chain skipping over unusable chunks. Things are now possibly hundreds of times slower than normal, and the backup job has been running for over a week with no sign of finishing any time soon; not sustainable.
For this example we’ve created a hybrid longcat that can be further extended (the legs and tail will be added at a later spacetime), and we want to rsync this to the backup server. The sender slides along 1 byte at a time, calculating the rolling checksum and performing a hash table lookup each time. Once it proceeds far enough to establish that a block can’t be sourced from the receiver’s end, it will transmit the block across the wire.
The problem here is that rsync was not designed for modern networks. When rsync was first designed and released in 1996, using the network was expensive – typical domestic connections were around 28.8Kbps with a latency of a few hundred milliseconds, and you paid for the time that you spent connected. rsync optimised heavily to reduce the number of roundtrips necessary to get the job done.
The thing is, modern networks aren’t like that at all, they’re high bandwidth and low latency. We can throw that Midsection chunk across to the backup server in about 10ms, and the entire 1.2TiB longcat can be transmitted in a number of hours. It simply doesn’t make sense to be precious about the network any more.
Giving things a boost
There’s any number of things we could do to improve rsync’s performance, but they’d be very invasive and would probably involve protocol-level changes to rsync, which isn’t practical. We’re looking for simple and effective improvements.
From detailed inspection of rsync’s behaviour and the compiled binary, we nailed down the exact region of code that was burning all the CPU time on the sender. As mentioned, this is the loop that rolls along the file and queries the hash table. We know where the receiver is up to in the file, so we can identify chunks that are earlier in the file and can be considered out of scope. Normally there’s no reason to remove them, but in this particular case there’s big gains to be had if we can skip them entirely, as it will shorten the chain.
Any improvement here will have big returns because the chain is traversed many, many times. “Surely”, we thought, “rsync already does this”; Tridge has coded many little improvements into rsync to wring out every last on-the-wire saving, but apparently not this one!
We added a small test to the chain traversal code that checks whether the chunk is still usable if we’re using
--inplace mode. If not, we remove it from the chain, which is quick because it’s a singly-linked list in memory. The first time we hit the chain full of unusable Midsection chunks, we strip most of them out in a single pass and they’ll never be checked again. That’s a big win.
How big are the gains?
In our particular case, the customer’s backups actually run to completion in about 3 days instead of operating on a galactic timescale. This is a fair way from a naive full copy, but we get to keep the benefits of btrfs’ copy-on-write functionality (which in itself is quite significant) and is not bad for the amount of work that rsync needs to do with the rolling checksums and other comparisons. In benchmarking we saw improvements of up to 150x in pathologically broken worst-case scenarios.
We want to share the love around, of course, so we’ve pushed it upstream. We’re happy to say that the patch has been accepted and was merged earlier this month.
There’s a few small details that we’ve glossed over here for the sake of clarity and brevity. If you’re really curious (hint: think about the ordering of chunks in the chain, and whether a previous chunk of the destination file is really unusable), feel free to shake us down for more details.
Update: We received a nice mail from Mark Curtis, the author of the original inplace patch for rsync, pointing out that inplace copies weren’t part of rsync in its original incarnation. Credit for that patch belongs to Mark, and it doesn’t look like Tridge was involved in integrating the patch either. We’ll admit to knowingly taking a bit of licence there in dropping Tridge’s name; the truth is that it made for a good headline that gets more attention. And cats, you guys love posts with cats.