Pulling apart Ceph’s CRUSH algorithm

By February 7, 2013Technical

As you’ve probably noticed, we’ve been evaluating Ceph in recent months for our petabyte-scale distributed storage needs. It’s a pretty great solution and works well, but it’s not the easiest thing to setup and administer properly.

One of the bits we’ve been grappling with recently is Ceph’s CRUSH map. In certain circumstances, which aren’t entirely clearly documented, it can fail to do the job and lead to a lack of guaranteed redundancy.

How CRUSH maps work

The CRUSH map algorithm is one of the jewels in Ceph’s crown, and provides a mostly deterministic way for clients to locate and distribute data on disks across the cluster. This avoids the need for an index server to coordinate reads and writes.

Clusters with index servers, such as the MDS in Lustre, funnel all operations through the index, which creates a single-point of failure and potential performance bottlenecks. As such, there was a conscious effort to avoid this model when Ceph was being designed.

As the Ceph administrator you build a CRUSH map to inform the cluster about the layout of your storage network. This lets you demarcate failure boundaries and specify how redundancy should be managed.

As an example, each server in your cluster has 2 disks, and there are 20 servers in a rack. There’s 8 racks in a row at the datacentre, and you have 10 rows. Perhaps you’re worried that an individual server will die, so you specify the CRUSH map to ensure that data is replicated to at least two other servers. Perhaps you’re worried about the failure of a whole rack due to the loss of a power feed. You can do that too, just tell the CRUSH map to ensure that replicas are stored in other racks, independent of the primary copy.

So the CRUSH map lets you specify constraints on the placement of your data. As long as these are fulfilled, it’s up to Ceph to spread the data out evenly to ensure smooth scaling and consistent performance.

Where this falls down

If you’ve been thinking “I’d use a hash function to evenly distribute the data once the constraints have been resolved”, you’d be right! As the documentation states, CRUSH is meant to pseudo-randomly distribute data across the cluster in a uniform manner.

Our testing cluster currently comprises three servers, with some 20TB of raw storage capacity across a dozen disks. We want two copies of our data, and have designated each server as a failure boundary. What we saw during testing was that the cluster would get hung up trying to meet our redundancy demands, and just stop replicating data.

We’ve talked about hash functions recently, they’re difficult to get right. Ceph currently supports the use of one hash function, named “rjenkins1” after the author of a paper describing the methodology.

It’s difficult to discern the intent just from looking at the code, but our gut feeling is that the mixing function isn’t being used quite as intended.

Looking for a workaround

We spent a good number of frustrating hours searching for the cause of the observed behaviour – this issue simply wasn’t documented! If the algorithm can’t calculate a suitable location to store extra copies of the data, it’ll jiggle the numbers a little and try again, until it eventually gives up.

What we eventually came across was a number of tunable parameters that were undocumented at the time. By default, Ceph will retry 19 times if it can’t find a suitable replica location. It’s now documented that this is insufficient for some clusters.

We found this for ourselves and things got moving once we tweaked choose_total_tries to a value of 100, but it’s annoying that there was this much seemingly broken by default.

Perhaps you’ve seen this for yourself if you’ve deployed a Ceph cluster – now you know why. 🙂

Being unsafe is the only way to be sure!

Woah there, you can’t just go and use those tunables willy-nilly. We were greeted with a scary and rather unhelpful error message if we tried to use those tunables – no really, the message is stored in a variable called scary_tunables_message, and doesn’t tell you what you should do or why you shouldn’t use the tunables.

After yet more digging through source code we discovered the cause – these tunables are unsafe if you’re running an older version of the OSD daemon, and need to be enabled with the --enable-unsafe-tunables flag.

This isn’t unreasonable, but it’s exceedingly frustrating that this wasn’t mentioned anywhere. Even now the --help output from crushtool doesn’t make any mention of this, nor the manpage.

Improvements ahead

This particular problem will go away once the cluster grows in size or we deploy a larger cluster – this is admittedly a small testing cluster by Ceph’s standards, but we don’t think it’s an unreasonable size for anyone looking to evaluate Ceph as a solid solution.

We’d be remiss if we didn’t mention that the docs are getting a lot better, too. In recent months they’ve been filled out quite a lot, with lots of questions being answered. It was quite reassuring to see a lot of discussion about Ceph at Linuxconf recently, and we’re certainly looking forward to the continued progress in future.

Have you deployed a petabyte-scale storage cluster today? We’re hiring.

One Comment

  • gregsfortytwo says:

    The issue you’ve observed here is a fundamental trade-off when using pseudo-random placement with replication like Ceph does. Consider the case where you ask it to give you two replicas, and you have two hosts to choose from. Obviously you can look at that and decide on a sensible replication strategy — but the algorithm needs to do it programmatically. So it conceptually takes as input the cluster description, the rules for placement, and the number of copies (2 in our example):

    pick_some_hosts(cluster, rules, copies)

    In order to generate the requested number of copies, it’s just iteratively calling a hash function:

    pick_some_hosts(cluster, rules, copies) {
      list hosts;
      while (--copies)
        hosts.push_back(pick_a_host(cluster, rules, copies));
      return hosts

    Notice that the parameters input to pick_a_host change on every iteration of the loop, so it’s generating different hashes…hopefully. The problem of course is that sometimes pick_a_host(cluster, rules, 2) and pick_a_host(cluster, rules, 1) might spit back the same number!

    In order to handle that and more complicated cases, CRUSH actually loops until it gets different numbers, and instead of passing pick_a_host the copy number it passes in which attempt-to-get-out-a-unique-number it’s on. But it can’t retry forever, so it specifies a maximum number of retry attempts to make before backing out. Hopefully if you back out, what actually happens is you were choosing an OSD from within a host and then it goes and picks a different host to pick out a new OSD from. But this is all statistical probabilities, so it’s possible that you get fantastically unlucky and the first 20 attempts to pick a new OSD all pick the same OSD. The odds of getting that unlucky are dramatically higher if you are requesting n copies and you only have n buckets to choose from than if you want n copies and you have 2*n buckets to choose from, so people tend to encounter this in POC deployments but not in their real systems.

    There are alternative ways of hashing that escape this problem, but they are all a lot harder to scale from 100TB to 10PB than CRUSH and Ceph are. And remember, we advertise Ceph as “petabyte-scale” storage, at which point you simply don’t see these kinds of issues. ;)

    [email protected]

Leave a Reply