A crash course in Ceph, a distributed replicated clustered filesystem

By September 14, 2012Technical

We’ve been looking at Ceph recently, it’s basically a fault-tolerant distributed clustered filesystem. If it works, that’s like a nirvana for shared storage: you have many servers, each one pitches in a few disks, and the there’s a filesystem that sits on top that visible to all servers in the cluster. If a disk fails, that’s okay too.

Those are really cool features, but it turns out that Ceph is really more than just that. To borrow a phrase, Ceph is like an onion – it’s got layers. The filesystem on top is nifty, but the coolest bits are below the surface.

If Ceph proves to be solid enough for use, we’ll need to train our sysadmins all about Ceph. That means pretty diagrams and explanations, which we thought would be more fun to share you.

Diagram

This is the logical diagram that we came up with while learning about Ceph. It might help to keep it open in another window as you read a description of the components and services.

Ceph’s major components, click through for a better view

Ceph components

We’ll start at the bottom of the stack and work our way up.

OSDs

OSD stands for Object Storage Device, and roughly corresponds to a physical disk. An OSD is actually a directory (eg. /var/lib/ceph/osd-1) that Ceph makes use of, residing on a regular filesystem, though it should be assumed to be opaque for the purposes of using it with Ceph.

Use of XFS or btrfs is recommended when creating OSDs, owing to their good performance, featureset (support for XATTRs larger than 4KiB) and data integrity.

We’re using btrfs for our testing.

Using RAIDed OSDs

A feature of Ceph is that it can tolerate the loss of OSDs. This means we can theoretically achieve fantastic utilisation of storage devices by obviating the need for RAID on every single device.

However, we’ve not yet determined whether this is awesome. At this stage we’re not using RAID, and just letting Ceph take care of block replication.

Placement Groups

Also referred to as PGs, the official docs note that placement groups help ensure performance and scalability, as tracking metadata for each individual object would be too costly.

A PG collects objects from the next layer up and manages them as a collection. It represents a mostly-static mapping to one or more underlying OSDs. Replication is done at the PG layer: the degree of replication (number of copies) is asserted higher, up at the Pool level, and all PGs in a pool will replicate stored objects into multiple OSDs.

As an example in a system with 3-way replication:

  • PG-1 might map to OSDs 1, 37 and 99
  • PG-2 might map to OSDs 4, 22 and 41
  • PG-3 might map to OSDs 18, 26 and 55
  • Etc.

Any object that happens to be stored on PG-1 will be written to all three OSDs (1,37,99). Any object stored in PG-2 will be written to its three OSDs (4,22,41). And so on.

Pools

A pool is the layer at which most user-interaction takes place. This is the important stuff like GET, PUT, DELETE actions for objects in a pool.

Pools contain a number of PGs, not shared with other pools (if you have multiple pools). The number of PGs in a pool is defined when the pool is first created, and can’t be changed later. You can think of PGs as providing a hash mapping for objects into OSDs, to ensure that the OSDs are filled evenly when adding objects to the pool.

CRUSH maps

CRUSH mappings are specified on a per-pool basis, and serve to skew the distribution of objects into OSDs according to administrator-defined policy. This is important for ensuring that replicas don’t end up on the same disk/host/rack/etc, which would break the entire point of having replicant copies.

A CRUSH map is written by hand, then compiled and passed to the cluster.

Still confused?

This may not make much sense at the moment, and that’s completely understandable. Someone on the Ceph mailing list provided a brief summary of the components which we found helpful for clarifying things:

* Many objects will map to one PG
* Each object maps to exactly one PG
* One PG maps to a list of OSDs. The first one in the list is the primary and the rest are replicas
* Many PGs can map to one OSD

A PG represents nothing but a grouping of objects; you configure the number of
PGs you want, number of OSDs * 100 is a good starting point, and all of your
stored objects are pseudo-randomly evenly distributed to the PGs.

So a PG explicitly does NOT represent a fixed amount of storage; it
represents 1/pg_num β€˜th of the storage you happen to have on your OSDs.

Ceph services

Now we’re into the good stuff. Pools full of objects are well and good, but what do you do with it now?

RADOS

What the lower layers ultimately provide is a RADOS cluster: Reliable Autonomic Distributed Object Store. At a practical level this translates to storing opaque blobs of data (objects) in high performance shared storage.

Because RADOS is fairly generic, it’s ideal for building more complex systems on top. One of these is RBD.

RBD

As the name suggests, a RADOS Block Device (RBD) is a block device stored in RADOS. RBD offers useful features on top of raw RADOS objects. From the official docs:

  • RBDs are striped over multiple PGs for performance
  • RBDs are resizable
  • Thin provisioning means on-disk space isn’t used until actually required

RBD also takes advantage of RADOS capabilities such as snapshotting and cloning, which would be very handy for applications like virtual machine disks.

CephFS

CephFS is a POSIX-compliant clustered filesystem implemented on top of RADOS. This is very elegant because the lower layer features of the stack provide really awesome filesystem features (such as snapshotting), while the CephFS layer just needs to translate that into a usable filesystem.

CephFS isn’t considered ready for prime-time just yet, but RADOS and RBD are.

We’re excited!

Anchor is mostly interested in the RBD service that Ceph provides. To date our VPS infrastructure has been very insular, with each hypervisor functioning independently. This works fantastically and avoids putting all our eggs in one basket, but the lure of shared storage is strong.

Our hypervisor of choice, KVM, already has support for direct integration with RBD, which makes it a very attractive option if we want to use shared storage. Shared storage for a VPS enables live migration between hypervisors (moving a VPS to another hypervisor without downtime), which is unbelievably cool.

CephFS is also something we’d like to be able to offer our customers when it matures. We’ve found sharing files between multiple servers in a highly-available fashion to be clunky at best. We’ve so far avoided solutions like GFS and Lustre due to the level of complexity, so we’re hoping CephFS will be a good option at the right scale.

Further reading

We wouldn’t dare to suggest that our notes here are complete or infallibly accurate. If you’re interested in Ceph, the following resources are worth a read.

Got any questions, comments or want to report a mistake? Feel free to let us know in the comments below, or send us a mail.

Ceph got y’all hot and bothered and wanting to help us build fantastic new infrastructure? We’re hiring.

3 Comments

  • bradj says:

    Sounds like a more advanced version of Gluster. Gluster has “bricks” which are plan directories, it will replicate each file x times to other bricks, then you just mount the clustered filesystem. They have an NFS server which provides transparent access for anything that already supports NFS.

    A very common feature request is to have what ceph is calling “CRUSH maps”. You could do this manually though by running multiple clusters and only allowing a single brick from each cluster to be on each host.

    Are you going to have a separate background network for replication like Google does in their GoogleFS clusters?

    I am definitely going to play with Ceph πŸ™‚ Btw not fair to test the stability of a clustered storage and then base it on the ‘beta’ btrfs filesystem πŸ™‚

  • bradj says:

    I take that back, it looks like they recommend XFS or btrfs only, and in my book XFSs inability to shrink makes it an annoying filesystem to use in a dynamic environment πŸ™‚

  • Barney Desmond says:

    Hi bradj,

    With our basic understanding of Gluster (we did a little internal evaluation of it), “a more advanced version of Gluster” sounds about fair. It sounds a little less flexible/abstracted than Ceph, in that it stores actual files straight into the filesystem on the bricks, but it’s cool that it has an already-working NFS frontend for clients (what with CephFS not being considered production-ready).

    We’re thinking we’d have some form of separate frontend and backend networks for a Ceph deployment. That said, Ceph clients need to be on the same network as Ceph servers, so the distinction is less clear. In a VM hosting environment, the backend would be between the storage nodes and compute nodes, while the frontend is between compute nodes and the rest of the internet.

    It’s kind of a tough call with XFS and btrfs at the moment. XFS is really quite solid, but just doesn’t have the shiny features that btrfs does. Meanwhile btrfs is a whole lot of fun to learn, and the docs are incomplete in places.

    Inability to shrink is annoying, but we’ve found we rarely need to do it. In a Ceph context it should never be a hassle – just blow away the OSD and remake it from scratch. πŸ™‚