Pulling apart Ceph’s CRUSH algorithm

February 7, 2013 Technical, General

As you’ve probably noticed, we’ve been evaluating Ceph in recent months for our petabyte-scale distributed storage needs. It’s a pretty great solution and works well, but it’s not the easiest thing to setup and administer properly.

One of the bits we’ve been grappling with recently is Ceph’s CRUSH map. In certain circumstances, which aren’t entirely clearly documented, it can fail to do the job and lead to a lack of guaranteed redundancy.

How CRUSH maps work

The CRUSH map algorithm is one of the jewels in Ceph’s crown, and provides a mostly deterministic way for clients to locate and distribute data on disks across the cluster. This avoids the need for an index server to coordinate reads and writes.

Clusters with index servers, such as the MDS in Lustre, funnel all operations through the index, which creates a single-point of failure and potential performance bottlenecks. As such, there was a conscious effort to avoid this model when Ceph was being designed.

As the Ceph administrator you build a CRUSH map to inform the cluster about the layout of your storage network. This lets you demarcate failure boundaries and specify how redundancy should be managed.

As an example, each server in your cluster has 2 disks, and there are 20 servers in a rack. There’s 8 racks in a row at the datacentre, and you have 10 rows. Perhaps you’re worried that an individual server will die, so you specify the CRUSH map to ensure that data is replicated to at least two other servers. Perhaps you’re worried about the failure of a whole rack due to the loss of a power feed. You can do that too, just tell the CRUSH map to ensure that replicas are stored in other racks, independent of the primary copy.

So the CRUSH map lets you specify constraints on the placement of your data. As long as these are fulfilled, it’s up to Ceph to spread the data out evenly to ensure smooth scaling and consistent performance.

Where this falls down

If you’ve been thinking “I’d use a hash function to evenly distribute the data once the constraints have been resolved”, you’d be right! As the documentation states, CRUSH is meant to pseudo-randomly distribute data across the cluster in a uniform manner.

Our testing cluster currently comprises three servers, with some 20TB of raw storage capacity across a dozen disks. We want two copies of our data, and have designated each server as a failure boundary. What we saw during testing was that the cluster would get hung up trying to meet our redundancy demands, and just stop replicating data.

We’ve talked about hash functions recently, they’re difficult to get right. Ceph currently supports the use of one hash function, named “rjenkins1” after the author of a paper describing the methodology.

It’s difficult to discern the intent just from looking at the code, but our gut feeling is that the mixing function isn’t being used quite as intended.

Looking for a workaround

We spent a good number of frustrating hours searching for the cause of the observed behaviour – this issue simply wasn’t documented! If the algorithm can’t calculate a suitable location to store extra copies of the data, it’ll jiggle the numbers a little and try again, until it eventually gives up.

What we eventually came across was a number of tunable parameters that were undocumented at the time. By default, Ceph will retry 19 times if it can’t find a suitable replica location. It’s now documented that this is insufficient for some clusters.

We found this for ourselves and things got moving once we tweaked choose_total_tries to a value of 100, but it’s annoying that there was this much seemingly broken by default.

Perhaps you’ve seen this for yourself if you’ve deployed a Ceph cluster – now you know why. 🙂

Being unsafe is the only way to be sure!

Woah there, you can’t just go and use those tunables willy-nilly. We were greeted with a scary and rather unhelpful error message if we tried to use those tunables – no really, the message is stored in a variable called scary_tunables_message, and doesn’t tell you what you should do or why you shouldn’t use the tunables.

After yet more digging through source code we discovered the cause – these tunables are unsafe if you’re running an older version of the OSD daemon, and need to be enabled with the --enable-unsafe-tunables flag.

This isn’t unreasonable, but it’s exceedingly frustrating that this wasn’t mentioned anywhere. Even now the --help output from crushtool doesn’t make any mention of this, nor the manpage.

Improvements ahead

This particular problem will go away once the cluster grows in size or we deploy a larger cluster – this is admittedly a small testing cluster by Ceph’s standards, but we don’t think it’s an unreasonable size for anyone looking to evaluate Ceph as a solid solution.

We’d be remiss if we didn’t mention that the docs are getting a lot better, too. In recent months they’ve been filled out quite a lot, with lots of questions being answered. It was quite reassuring to see a lot of discussion about Ceph at Linuxconf recently, and we’re certainly looking forward to the continued progress in future.

Have you deployed a petabyte-scale storage cluster today? We’re hiring.