A crash course in Ceph, a distributed replicated clustered filesystem
We’ve been looking at Ceph recently, it’s basically a fault-tolerant distributed clustered filesystem. If it works, that’s like a nirvana for shared storage: you have many servers, each one pitches in a few disks, and the there’s a filesystem that sits on top that visible to all servers in the cluster. If a disk fails, that’s okay too.
Those are really cool features, but it turns out that Ceph is really more than just that. To borrow a phrase, Ceph is like an onion – it’s got layers. The filesystem on top is nifty, but the coolest bits are below the surface.
If Ceph proves to be solid enough for use, we’ll need to train our sysadmins all about Ceph. That means pretty diagrams and explanations, which we thought would be more fun to share you.
This is the logical diagram that we came up with while learning about Ceph. It might help to keep it open in another window as you read a description of the components and services.
We’ll start at the bottom of the stack and work our way up.
OSD stands for Object Storage Device, and roughly corresponds to a physical disk. An OSD is actually a directory (eg.
/var/lib/ceph/osd-1) that Ceph makes use of, residing on a regular filesystem, though it should be assumed to be opaque for the purposes of using it with Ceph.
Use of XFS or btrfs is recommended when creating OSDs, owing to their good performance, featureset (support for XATTRs larger than 4KiB) and data integrity.
We’re using btrfs for our testing.
Using RAIDed OSDs
A feature of Ceph is that it can tolerate the loss of OSDs. This means we can theoretically achieve fantastic utilisation of storage devices by obviating the need for RAID on every single device.
However, we’ve not yet determined whether this is awesome. At this stage we’re not using RAID, and just letting Ceph take care of block replication.
Also referred to as PGs, the official docs note that placement groups help ensure performance and scalability, as tracking metadata for each individual object would be too costly.
A PG collects objects from the next layer up and manages them as a collection. It represents a mostly-static mapping to one or more underlying OSDs. Replication is done at the PG layer: the degree of replication (number of copies) is asserted higher, up at the Pool level, and all PGs in a pool will replicate stored objects into multiple OSDs.
As an example in a system with 3-way replication:
- PG-1 might map to OSDs 1, 37 and 99
- PG-2 might map to OSDs 4, 22 and 41
- PG-3 might map to OSDs 18, 26 and 55
Any object that happens to be stored on PG-1 will be written to all three OSDs (1,37,99). Any object stored in PG-2 will be written to its three OSDs (4,22,41). And so on.
A pool is the layer at which most user-interaction takes place. This is the important stuff like GET, PUT, DELETE actions for objects in a pool.
Pools contain a number of PGs, not shared with other pools (if you have multiple pools). The number of PGs in a pool is defined when the pool is first created, and can’t be changed later. You can think of PGs as providing a hash mapping for objects into OSDs, to ensure that the OSDs are filled evenly when adding objects to the pool.
CRUSH mappings are specified on a per-pool basis, and serve to skew the distribution of objects into OSDs according to administrator-defined policy. This is important for ensuring that replicas don’t end up on the same disk/host/rack/etc, which would break the entire point of having replicant copies.
A CRUSH map is written by hand, then compiled and passed to the cluster.
This may not make much sense at the moment, and that’s completely understandable. Someone on the Ceph mailing list provided a brief summary of the components which we found helpful for clarifying things:
Now we’re into the good stuff. Pools full of objects are well and good, but what do you do with it now?
What the lower layers ultimately provide is a RADOS cluster: Reliable Autonomic Distributed Object Store. At a practical level this translates to storing opaque blobs of data (objects) in high performance shared storage.
Because RADOS is fairly generic, it’s ideal for building more complex systems on top. One of these is RBD.
As the name suggests, a RADOS Block Device (RBD) is a block device stored in RADOS. RBD offers useful features on top of raw RADOS objects. From the official docs:
- RBDs are striped over multiple PGs for performance
- RBDs are resizable
- Thin provisioning means on-disk space isn’t used until actually required
RBD also takes advantage of RADOS capabilities such as snapshotting and cloning, which would be very handy for applications like virtual machine disks.
CephFS is a POSIX-compliant clustered filesystem implemented on top of RADOS. This is very elegant because the lower layer features of the stack provide really awesome filesystem features (such as snapshotting), while the CephFS layer just needs to translate that into a usable filesystem.
CephFS isn’t considered ready for prime-time just yet, but RADOS and RBD are.
Anchor is mostly interested in the RBD service that Ceph provides. To date our VPS infrastructure has been very insular, with each hypervisor functioning independently. This works fantastically and avoids putting all our eggs in one basket, but the lure of shared storage is strong.
Our hypervisor of choice, KVM, already has support for direct integration with RBD, which makes it a very attractive option if we want to use shared storage. Shared storage for a VPS enables live migration between hypervisors (moving a VPS to another hypervisor without downtime), which is unbelievably cool.
CephFS is also something we’d like to be able to offer our customers when it matures. We’ve found sharing files between multiple servers in a highly-available fashion to be clunky at best. We’ve so far avoided solutions like GFS and Lustre due to the level of complexity, so we’re hoping CephFS will be a good option at the right scale.
We wouldn’t dare to suggest that our notes here are complete or infallibly accurate. If you’re interested in Ceph, the following resources are worth a read.
- Florian Haas, one of The HA Guys, has a good little intro to the technical virtues of Ceph
- The official Ceph docs are remarkably good once you know what you’re looking for, and have been updated recently
- More details about Pools
- More details about Placement Groups
- A description of how data is arranged and placed on the OSDs
Got any questions, comments or want to report a mistake? Feel free to let us know in the comments below, or send us a mail.