A crash course in Ceph, a distributed replicated clustered filesystem

September 14, 2012 Technical, General

We’ve been looking at Ceph recently, it’s basically a fault-tolerant distributed clustered filesystem. If it works, that’s like a nirvana for shared storage: you have many servers, each one pitches in a few disks, and the there’s a filesystem that sits on top that visible to all servers in the cluster. If a disk fails, that’s okay too.

Those are really cool features, but it turns out that Ceph is really more than just that. To borrow a phrase, Ceph is like an onion – it’s got layers. The filesystem on top is nifty, but the coolest bits are below the surface.

If Ceph proves to be solid enough for use, we’ll need to train our sysadmins all about Ceph. That means pretty diagrams and explanations, which we thought would be more fun to share you.


This is the logical diagram that we came up with while learning about Ceph. It might help to keep it open in another window as you read a description of the components and services.

Ceph’s major components, click through for a better view

Ceph components

We’ll start at the bottom of the stack and work our way up.


OSD stands for Object Storage Device, and roughly corresponds to a physical disk. An OSD is actually a directory (eg. /var/lib/ceph/osd-1) that Ceph makes use of, residing on a regular filesystem, though it should be assumed to be opaque for the purposes of using it with Ceph.

Use of XFS or btrfs is recommended when creating OSDs, owing to their good performance, featureset (support for XATTRs larger than 4KiB) and data integrity.

We’re using btrfs for our testing.

Using RAIDed OSDs

A feature of Ceph is that it can tolerate the loss of OSDs. This means we can theoretically achieve fantastic utilisation of storage devices by obviating the need for RAID on every single device.

However, we’ve not yet determined whether this is awesome. At this stage we’re not using RAID, and just letting Ceph take care of block replication.

Placement Groups

Also referred to as PGs, the official docs note that placement groups help ensure performance and scalability, as tracking metadata for each individual object would be too costly.

A PG collects objects from the next layer up and manages them as a collection. It represents a mostly-static mapping to one or more underlying OSDs. Replication is done at the PG layer: the degree of replication (number of copies) is asserted higher, up at the Pool level, and all PGs in a pool will replicate stored objects into multiple OSDs.

As an example in a system with 3-way replication:

  • PG-1 might map to OSDs 1, 37 and 99
  • PG-2 might map to OSDs 4, 22 and 41
  • PG-3 might map to OSDs 18, 26 and 55
  • Etc.

Any object that happens to be stored on PG-1 will be written to all three OSDs (1,37,99). Any object stored in PG-2 will be written to its three OSDs (4,22,41). And so on.


A pool is the layer at which most user-interaction takes place. This is the important stuff like GET, PUT, DELETE actions for objects in a pool.

Pools contain a number of PGs, not shared with other pools (if you have multiple pools). The number of PGs in a pool is defined when the pool is first created, and can’t be changed later. You can think of PGs as providing a hash mapping for objects into OSDs, to ensure that the OSDs are filled evenly when adding objects to the pool.

CRUSH maps

CRUSH mappings are specified on a per-pool basis, and serve to skew the distribution of objects into OSDs according to administrator-defined policy. This is important for ensuring that replicas don’t end up on the same disk/host/rack/etc, which would break the entire point of having replicant copies.

A CRUSH map is written by hand, then compiled and passed to the cluster.

Still confused?

This may not make much sense at the moment, and that’s completely understandable. Someone on the Ceph mailing list provided a brief summary of the components which we found helpful for clarifying things:

* Many objects will map to one PG
* Each object maps to exactly one PG
* One PG maps to a list of OSDs. The first one in the list is the primary and the rest are replicas
* Many PGs can map to one OSD

A PG represents nothing but a grouping of objects; you configure the number of
PGs you want, number of OSDs * 100 is a good starting point, and all of your
stored objects are pseudo-randomly evenly distributed to the PGs.

So a PG explicitly does NOT represent a fixed amount of storage; it
represents 1/pg_num ‘th of the storage you happen to have on your OSDs.

Ceph services

Now we’re into the good stuff. Pools full of objects are well and good, but what do you do with it now?


What the lower layers ultimately provide is a RADOS cluster: Reliable Autonomic Distributed Object Store. At a practical level this translates to storing opaque blobs of data (objects) in high performance shared storage.

Because RADOS is fairly generic, it’s ideal for building more complex systems on top. One of these is RBD.


As the name suggests, a RADOS Block Device (RBD) is a block device stored in RADOS. RBD offers useful features on top of raw RADOS objects. From the official docs:

  • RBDs are striped over multiple PGs for performance
  • RBDs are resizable
  • Thin provisioning means on-disk space isn’t used until actually required

RBD also takes advantage of RADOS capabilities such as snapshotting and cloning, which would be very handy for applications like virtual machine disks.


CephFS is a POSIX-compliant clustered filesystem implemented on top of RADOS. This is very elegant because the lower layer features of the stack provide really awesome filesystem features (such as snapshotting), while the CephFS layer just needs to translate that into a usable filesystem.

CephFS isn’t considered ready for prime-time just yet, but RADOS and RBD are.

We’re excited!

Anchor is mostly interested in the RBD service that Ceph provides. To date our VPS infrastructure has been very insular, with each hypervisor functioning independently. This works fantastically and avoids putting all our eggs in one basket, but the lure of shared storage is strong.

Our hypervisor of choice, KVM, already has support for direct integration with RBD, which makes it a very attractive option if we want to use shared storage. Shared storage for a VPS enables live migration between hypervisors (moving a VPS to another hypervisor without downtime), which is unbelievably cool.

CephFS is also something we’d like to be able to offer our customers when it matures. We’ve found sharing files between multiple servers in a highly-available fashion to be clunky at best. We’ve so far avoided solutions like GFS and Lustre due to the level of complexity, so we’re hoping CephFS will be a good option at the right scale.

Further reading

We wouldn’t dare to suggest that our notes here are complete or infallibly accurate. If you’re interested in Ceph, the following resources are worth a read.

Got any questions, comments or want to report a mistake? Feel free to let us know in the comments below, or send us a mail.

Ceph got y’all hot and bothered and wanting to help us build fantastic new infrastructure? We’re hiring.