Pulling apart Ceph’s CRUSH algorithm

As you’ve probably noticed, we’ve been evaluating Ceph in recent months for our petabyte-scale distributed storage needs. It’s a pretty great solution and works well, but it’s not the easiest thing to setup and administer properly.

One of the bits we’ve been grappling with recently is Ceph’s CRUSH map. In certain circumstances, which aren’t entirely clearly documented, it can fail to do the job and lead to a lack of guaranteed redundancy.

How CRUSH maps work

The CRUSH map algorithm is one of the jewels in Ceph’s crown, and provides a mostly deterministic way for clients to locate and distribute data on disks across the cluster. This avoids the need for an index server to coordinate reads and writes.

Clusters with index servers, such as the MDS in Lustre, funnel all operations through the index, which creates a single-point of failure and potential performance bottlenecks. As such, there was a conscious effort to avoid this model when Ceph was being designed.

As the Ceph administrator you build a CRUSH map to inform the cluster about the layout of your storage network. This lets you demarcate failure boundaries and specify how redundancy should be managed.

As an example, each server in your cluster has 2 disks, and there are 20 servers in a rack. There’s 8 racks in a row at the datacentre, and you have 10 rows. Perhaps you’re worried that an individual server will die, so you specify the CRUSH map to ensure that data is replicated to at least two other servers. Perhaps you’re worried about the failure of a whole rack due to the loss of a power feed. You can do that too, just tell the CRUSH map to ensure that replicas are stored in other racks, independent of the primary copy.

So the CRUSH map lets you specify constraints on the placement of your data. As long as these are fulfilled, it’s up to Ceph to spread the data out evenly to ensure smooth scaling and consistent performance.

Where this falls down

If you’ve been thinking “I’d use a hash function to evenly distribute the data once the constraints have been resolved”, you’d be right! As the documentation states, CRUSH is meant to pseudo-randomly distribute data across the cluster in a uniform manner.

Our testing cluster currently comprises three servers, with some 20TB of raw storage capacity across a dozen disks. We want two copies of our data, and have designated each server as a failure boundary. What we saw during testing was that the cluster would get hung up trying to meet our redundancy demands, and just stop replicating data.

We’ve talked about hash functions recently, they’re difficult to get right. Ceph currently supports the use of one hash function, named “rjenkins1″ after the author of a paper describing the methodology.

It’s difficult to discern the intent just from looking at the code, but our gut feeling is that the mixing function isn’t being used quite as intended.

Looking for a workaround

We spent a good number of frustrating hours searching for the cause of the observed behaviour – this issue simply wasn’t documented! If the algorithm can’t calculate a suitable location to store extra copies of the data, it’ll jiggle the numbers a little and try again, until it eventually gives up.

What we eventually came across was a number of tunable parameters that were undocumented at the time. By default, Ceph will retry 19 times if it can’t find a suitable replica location. It’s now documented that this is insufficient for some clusters.

We found this for ourselves and things got moving once we tweaked choose_total_tries to a value of 100, but it’s annoying that there was this much seemingly broken by default.

Perhaps you’ve seen this for yourself if you’ve deployed a Ceph cluster – now you know why. :)

Being unsafe is the only way to be sure!

Woah there, you can’t just go and use those tunables willy-nilly. We were greeted with a scary and rather unhelpful error message if we tried to use those tunables – no really, the message is stored in a variable called scary_tunables_message, and doesn’t tell you what you should do or why you shouldn’t use the tunables.

After yet more digging through source code we discovered the cause – these tunables are unsafe if you’re running an older version of the OSD daemon, and need to be enabled with the --enable-unsafe-tunables flag.

This isn’t unreasonable, but it’s exceedingly frustrating that this wasn’t mentioned anywhere. Even now the --help output from crushtool doesn’t make any mention of this, nor the manpage.

Improvements ahead

This particular problem will go away once the cluster grows in size or we deploy a larger cluster – this is admittedly a small testing cluster by Ceph’s standards, but we don’t think it’s an unreasonable size for anyone looking to evaluate Ceph as a solid solution.

We’d be remiss if we didn’t mention that the docs are getting a lot better, too. In recent months they’ve been filled out quite a lot, with lots of questions being answered. It was quite reassuring to see a lot of discussion about Ceph at Linuxconf recently, and we’re certainly looking forward to the continued progress in future.

Have you deployed a petabyte-scale storage cluster today? We’re hiring.
1
Comment

Bughunting in Ceph’s radosgw: ETags

RADOS Gateway (henceforth referred to as radosgw) is an add-on component for Ceph, large-scale clustered storage now mainlined in the Linux kernel. radosgw provides an S3-compatible interface for object storage, which we’re evaluating for a future product offering.

We’ve spent the last few days digging through radosgw source trying to nail a some pesky bugs. For once, the clients don’t appear to be breaking spec, it’s radosgw itself.

We’re using DragonDisk as our S3-alike client – what works? PUTing and GETing files works, obviously. Setting the Content-Type metadata returns a failure, and renaming a directory almost works – it gets duplicated to the new name, but the old copy hangs around.

Wireshark to the rescue! We started pulling apart packet dumps, and it quickly became evident that setting Content-Type on objects was actually working fine, so what gives? We put it side by side with some known-good traffic to S3 and quickly spotted the quirks.

radosgw wasn’t returning the new object’s ETag in the HTTP response headers. The Amazon S3 docs are hopelessly imprecise in some areas, but it seems the ETag is mandatory. Aha!

ETag: "0fc9ecb587a7ead6e2468ff759eb034b"

Quotes around my ETag? Why!?

We weren’t done yet though, DragonDisk was still reporting “invalid response”. So we looked a bit closer and discovered that the ETag is also meant to be in the response body, which is a brief XML document. Easy enough to fix, but somehow this had never cropped up in development.

<ETag>&quot;623c61392f8dbf45e2dfad30a1e5233e&quot;</ETag>

That’s not a typo – apparently the official way to do this is to steal the ETag directly from the headers, leave the quotes in (which is pointless for both HTTP and XML), and then convert them to HTML entities! Someone at Amazon really wasn’t paying attention when they developed the initial spec and code.

We’re not quite done yet. Our ETags are all sorted, but DragonDisk is still reporting invalid response from radosgw sometimes. That’s going to be another fun instalment.

0
Comments

A crash course in Ceph, a distributed replicated clustered filesystem

We’ve been looking at Ceph recently, it’s basically a fault-tolerant distributed clustered filesystem. If it works, that’s like a nirvana for shared storage: you have many servers, each one pitches in a few disks, and the there’s a filesystem that sits on top that visible to all servers in the cluster. If a disk fails, that’s okay too.

Those are really cool features, but it turns out that Ceph is really more than just that. To borrow a phrase, Ceph is like an onion – it’s got layers. The filesystem on top is nifty, but the coolest bits are below the surface.

If Ceph proves to be solid enough for use, we’ll need to train our sysadmins all about Ceph. That means pretty diagrams and explanations, which we thought would be more fun to share you.

Diagram

This is the logical diagram that we came up with while learning about Ceph. It might help to keep it open in another window as you read a description of the components and services.

Ceph’s major components, click through for a better view

Ceph components

We’ll start at the bottom of the stack and work our way up.

OSDs

OSD stands for Object Storage Device, and roughly corresponds to a physical disk. An OSD is actually a directory (eg. /var/lib/ceph/osd-1) that Ceph makes use of, residing on a regular filesystem, though it should be assumed to be opaque for the purposes of using it with Ceph.

Use of XFS or btrfs is recommended when creating OSDs, owing to their good performance, featureset (support for XATTRs larger than 4KiB) and data integrity.

We’re using btrfs for our testing.

Using RAIDed OSDs

A feature of Ceph is that it can tolerate the loss of OSDs. This means we can theoretically achieve fantastic utilisation of storage devices by obviating the need for RAID on every single device.

However, we’ve not yet determined whether this is awesome. At this stage we’re not using RAID, and just letting Ceph take care of block replication.

Placement Groups

Also referred to as PGs, the official docs note that placement groups help ensure performance and scalability, as tracking metadata for each individual object would be too costly.

A PG collects objects from the next layer up and manages them as a collection. It represents a mostly-static mapping to one or more underlying OSDs. Replication is done at the PG layer: the degree of replication (number of copies) is asserted higher, up at the Pool level, and all PGs in a pool will replicate stored objects into multiple OSDs.

As an example in a system with 3-way replication:

  • PG-1 might map to OSDs 1, 37 and 99
  • PG-2 might map to OSDs 4, 22 and 41
  • PG-3 might map to OSDs 18, 26 and 55
  • Etc.

Any object that happens to be stored on PG-1 will be written to all three OSDs (1,37,99). Any object stored in PG-2 will be written to its three OSDs (4,22,41). And so on.

Pools

A pool is the layer at which most user-interaction takes place. This is the important stuff like GET, PUT, DELETE actions for objects in a pool.

Pools contain a number of PGs, not shared with other pools (if you have multiple pools). The number of PGs in a pool is defined when the pool is first created, and can’t be changed later. You can think of PGs as providing a hash mapping for objects into OSDs, to ensure that the OSDs are filled evenly when adding objects to the pool.

CRUSH maps

CRUSH mappings are specified on a per-pool basis, and serve to skew the distribution of objects into OSDs according to administrator-defined policy. This is important for ensuring that replicas don’t end up on the same disk/host/rack/etc, which would break the entire point of having replicant copies.

A CRUSH map is written by hand, then compiled and passed to the cluster.

Still confused?

This may not make much sense at the moment, and that’s completely understandable. Someone on the Ceph mailing list provided a brief summary of the components which we found helpful for clarifying things:

* Many objects will map to one PG
* Each object maps to exactly one PG
* One PG maps to a list of OSDs. The first one in the list is the primary and the rest are replicas
* Many PGs can map to one OSD

A PG represents nothing but a grouping of objects; you configure the number of
PGs you want, number of OSDs * 100 is a good starting point, and all of your
stored objects are pseudo-randomly evenly distributed to the PGs.

So a PG explicitly does NOT represent a fixed amount of storage; it
represents 1/pg_num ‘th of the storage you happen to have on your OSDs.

Ceph services

Now we’re into the good stuff. Pools full of objects are well and good, but what do you do with it now?

RADOS

What the lower layers ultimately provide is a RADOS cluster: Reliable Autonomic Distributed Object Store. At a practical level this translates to storing opaque blobs of data (objects) in high performance shared storage.

Because RADOS is fairly generic, it’s ideal for building more complex systems on top. One of these is RBD.

RBD

As the name suggests, a RADOS Block Device (RBD) is a block device stored in RADOS. RBD offers useful features on top of raw RADOS objects. From the official docs:

  • RBDs are striped over multiple PGs for performance
  • RBDs are resizable
  • Thin provisioning means on-disk space isn’t used until actually required

RBD also takes advantage of RADOS capabilities such as snapshotting and cloning, which would be very handy for applications like virtual machine disks.

CephFS

CephFS is a POSIX-compliant clustered filesystem implemented on top of RADOS. This is very elegant because the lower layer features of the stack provide really awesome filesystem features (such as snapshotting), while the CephFS layer just needs to translate that into a usable filesystem.

CephFS isn’t considered ready for prime-time just yet, but RADOS and RBD are.

We’re excited!

Anchor is mostly interested in the RBD service that Ceph provides. To date our VPS infrastructure has been very insular, with each hypervisor functioning independently. This works fantastically and avoids putting all our eggs in one basket, but the lure of shared storage is strong.

Our hypervisor of choice, KVM, already has support for direct integration with RBD, which makes it a very attractive option if we want to use shared storage. Shared storage for a VPS enables live migration between hypervisors (moving a VPS to another hypervisor without downtime), which is unbelievably cool.

CephFS is also something we’d like to be able to offer our customers when it matures. We’ve found sharing files between multiple servers in a highly-available fashion to be clunky at best. We’ve so far avoided solutions like GFS and Lustre due to the level of complexity, so we’re hoping CephFS will be a good option at the right scale.

Further reading

We wouldn’t dare to suggest that our notes here are complete or infallibly accurate. If you’re interested in Ceph, the following resources are worth a read.

Got any questions, comments or want to report a mistake? Feel free to let us know in the comments below, or send us a mail.

Ceph got y’all hot and bothered and wanting to help us build fantastic new infrastructure? We’re hiring.
3
Comments

Resonance Cachecade

We don’t normally post about hardware wankery, but this little piece of shininess appeared for free in some of the newer Dell servers we’ve been ordering, and it actually sounds like it’s not an awful hack.

Cachecade is an LSI technology (Dell PERC cards are rebranded gear) that adds a read-cache tier to the RAID logic, in the form of solid-state disks. While SSDs are still too expensive for mass-scale primary storage, they’re cheap enough that you can burn a few hundred bucks and get 50gb worth of faster reads.

The real benefit of this style of read-cache should be for random block reads, where SSDs proverbially drop excrement over rotational media from a great height. The jury is still out for us – we’ve just started using cachecade on a couple of VM hypervisors and a customer DB server, but we’re hoping to see some noticeable impact even on a qualitative basis.

In truth, the performance improvements will be difficult for us to quantify on our own workloads. You can apparently get this functionality if you purchase the new LSI® MegaRAID® CacheCade™ Pro 2.0, but I’d bet that it’s not exposed through something sane (like SNMP) and you’ll be forced to use the perennially-awful MegaCLI tool to get at the data.

1
Comment

Secure encrypted storage for your hosted server, VM, VPS, cloud server or $BUZZ_WORD

So you’ve just provisioned your shiny new OS instance with your host of choice, loaded in your confidential data and away you go without a worry in the world right? If your data consists only of captioned photos of cute furry animals, then all is well. Perhaps however, your data is worth just a wee bit more than that (not that we don’t ♥ cute furry animals!).

Depending on your host and product used, your data could be sitting on anywhere from locally attached disks, a NAS/SAN or some clustered distributed block device/filesystem with no way to easily determine who has access to it, what snapshots exist, what will happen to failed media, etc. For certain customers with certain sensitive applications, that is simply not an acceptable risk.

To protect your data at rest, the most reassuring option is to utilise disk encryption within your operating system where only you possess the key. Whilst this can present a bit of an operational challenge (i.e. how does the key get entered if the server is rebooted), the show stopper question is going to be “Can I actually run a production database with heavy I/O load on encrypted storage?”. This post seeks to answer that question or to at least help you figure it out for your circumstances.

Database server

Our test server is a Dell PowerEdge R410 with:

  • 2 × Intel Xeon E5620 CPUs
  • 16 GB of RAM
  • Dell PERC H700 hardware RAID controller
  • 2 × Dell 50GB SSDs in a hardware RAID 1 array for the database
  • 2 × 300GB 15K SAS in a hardware RAID 1 array for the operating system
  • Red Hat Enterprise Linux Server 6.0 (kernel 2.6.32-71.18.1.el6.x86_64)

The interesting thing to note in this setup is that the CPUs contain hardware accelerated AES encryption via the AES instruction set (Intel’s marketing name is AES-NI). You can check if your CPU and kernel support this feature by running at the command line:

grep aes /proc/cpuinfo

if you get output, then congratulations, you have the feature supported!

Encrypted storage

The recommended way to do block device level encryption under Linux is to use the device mapper target dm-crypt. This is a straightforward component that is built-in to the Linux kernel and well supported in all modern Linux distributions. To set up full disk encryption, you should use your distribution’s installer when you initially provision your server. If you are not encrypting your entire OS (sans /boot) and data, then you will need to use the utility cryptsetup on the storage device you want to encrypt. For our testing, we are just encrypting the SSDs as follows:

cryptsetup luksFormat /dev/sdb -c aes-xts-plain -s 512

This command formats the storage device with a LUKS meta-data header. This header contains useful information including:

  • the fact that it contains encrypted data (rather than just random data)
  • a UUID
  • the  algorithm used
  • the key size
  • the key (securely encrypted with a passphrase that you provide at format time)
  • checksums and other useful bits

For high security, we are using a 512 bit AES key in an XTS cipher mode (requires kernel >= 2.6.24). A small word of warning: whilst cryptsetup allows you to setup encryption on devices without any sort of identifying meta-data header, DON’T ever do that. If you need to hide the fact that you are using encryption there are better ways to do that. Contact us for details if you really want to go down that path.

Once you have formatted the device, you then need to activate it by running the command:

cryptsetup luksOpen /dev/sdb testing

this will then make the unencrypted block device ready for you to use (i.e mkfs) as /dev/mapper/testing. To have the operating system setup the device at boot, you will need to put an entry into /etc/crypttab. Check the man page for details.

Finally, for best performance you want the implementation of AES that is most optimised for your hardware. You can check /proc/crypto to see which algorithms and drivers are available and their priority (highest priority number indicates higher priority). In order of performance, you want AES-NI, AES-x86_64 and the generic AES implementation last. The highest priority crypto implementation at the time of an encrypted block device being activated (via cryptsetup luksOpen) is used. Higher priority drivers installed into a running kernel have no effect on active devices.

Benchmarking

The simple benchmark tool zcav is fine for our purposes. It only measures the sustained read/write speed. For an I/O device, this benchmark would normally be of limited value as the device bottleneck is usually with the rate of I/O requests that can be handled. For an encrypted block device, the bottleneck is instead in the throughput (i.e. the sustained read/write speeds) as the higher the throughput, the more work that the CPU has to do.

Results

The output from zcav was fed into gnuplot to generate the following graphs:

Read benchmark results

The obvious thing to note in this diagram is that the new AES instructions make a huge difference to performance. The impact of the encryption is relatively low and should be quite acceptable for a production system. An interesting thing to note is the relatively low performance of the RAID array. Being RAID 1, the RAID controller should have been able to balance the read requests over both drives to get about double what it is getting. I haven’t looked into the RAID settings though to see if this can be tuned.

Write benchmark results

The throughput here with the AES instructions in use is almost identical.

Conclusion

The results certainly look promising for even relatively high end work loads. One thing that is not measured here is the latency difference. It should be relatively straightforward to calculate the latency that will be added to every I/O request.

The numbers above can also be massively improved upon with a newer kernel featuring the dm-crypt scalability patches (the benchmark above was limited to using 1 core out of 8!).

The nice thing about using the built-in dm-crypt solution is that you can use it on physical as well as virtual machines. If you are after something a bit more turn-key, you may wish to look at a system with drives (and a controller and BIOS) that support self encrypting drives as an alternative.

2
Comments