Vaultaire: a Ceph based, immutable TSDB
Anchor has been working on building a massively scalable data vault for metrics data. One of our engineers has written a blog post about what worked – and what didn’t – in the first version, and what we’ve learned from it.
Here at Anchor we are reworking the way we store systems metrics. This has inspired the development of a new immutable and lossless Time Series Database (TSDB), a concept for which we have coined the term “a data vault for metrics”.
We currently use Round-Robin Databases (RRD) to store metrics; unfortunately, aside from the simplicity and constrained size of RRDs, they don’t have much going for them. In particular, RRDtool:
- by design, cannot import historical data;
- sacrifices data over time to save space; and
- requires one file per metric, which is neither highly-available nor scalable.
We wanted a tool that can store unlimited amounts of data forever, without loss, reliably and immutably, in a manner such that it can be efficiently retrieved.
The off-the-shelf shelf looking a little empty
We looked far and wide for a tool that would fill this order, but came up short:
- OpenTSDB, discounted due to its reliance on Hadoop and HBase, which involves too much Java infrastructure for our tastes;
- Whisper (Graphite’s back-end), discounted due to scalability prospects and its data loss policy.
We went on to consider writing wrappers around various datastores:
- Riak, discounted after testing proved write performance to be slow due to Riak lacking a bulk insert API;
- Redis, which we are great fans of at Anchor. Unfortunately Redis just doesn’t scale infinitely (we have a lot of experience with this). By design, it’s a single instance memory limited datastore – not what you use for durability or ever growing data sets;
- Cassandra, which everyone else is in love with, but from experience we know to be a nightmare to maintain at scale;
- InfluxDB is similar to what we need, but it’s not quite ready for production use yet and doesn’t provide the scaling properties we require.
- PostgreSQL is lovely, and we’re big fans, but it would be wasteful — we don’t require indexing or relations; and
- ElasticSearch would also be wasteful and almost as much overhead as Hadoop.
It might come as a surprise that we weren’t interested in indexing or relations, but the scalability of the system we wanted would be predicated on not needing them. Thus we decided to build our own data vault, to be backed by Ceph.
We boiled down the attributes that the system must have: scalability, immutability, losslessness and reliability. We identify with the Unix philosophy: “Write programs that do one thing and do it well”; scalability through modularity would be the theme of our design.
Decentralised workers pulling work from a central broker is the pattern we chose for the reasons of scalability and pain-free configuration. ZeroMQ made implementing this pattern easy. Unlimited scaling is a stretch, but by avoiding indices and partitioning our data intelligently, we can get close.
The data partitioning problem can become quite complex (think indices), so we decided to punt on the issue initially, and write each metric to its own Ceph object. Writing each metric to its own object was bound to create overhead, but after running some preliminary benchmarks, we thought we would be okay. Temporal locality is provided by writing to a new set of objects every 10,000 seconds.
Immutability is a cornerstone of this design. Enforcing the fact that a metric written at a given time cannot be modified allows us to make the promise that if you write some bytes to Vaultaire, you will be always be able to retrieve them.
Ceph takes care of our horizontally scaling data store, we just need a data format. Lossless storage is easy (you just don’t destroy the data), but it comes at a price; with all that data being stored forever, we need a lot of storage space. This leaves us with the responsibility of designing, or picking, an efficient file format. We decided to go with Google protobufs, which turned out to be a mistake, but more on that later.
Reliable transmission is handled at the application level by using a parallel Selective Repeat ARQ — that is to say that we associate a unique identifier to each packet of metrics and acknowledge each transmission back to the client. The use of ZeroMQ as a reliable messaging service means that this fail-safe mechanism is rarely exercised.
To summarise, the design works like this: a client library sends metrics to a broker, writers daemons connected to the broker receive metrics and store them in Ceph. To read data, a client sends request(s) to a broker, reader daemons connected to the broker receive requests and reply with stored metrics from Ceph.
The first three months of development saw a prototype delivered — along with a few issues. In developing Vaultaire, we were able to push a promising amount of code back to the open source community, including:
- Haskell bindings to librados;
- Exposing atomic write support in the librados C API;
- An implementation of the xxhash algorithm for Haskell; and,
- Machiavelli, a web-based data visualisation tool.
As developers we enjoy writing good software, and enjoy being recognised for that contribution. By giving back to the open source community we have motivated ourselves to write better, more flexible and more modular software. This makes us happy.
The final repositories of Vaultaire were thus:
- chateau, a message broker;
- libmarquise, the client library;
- vaultaire, the reader and writer daemons;
- chevalier, for indexing and searching our list of metrics in ElasticSearch;
- descartes, an interpolating HTTP JSON API for Vaultaire;
- Golang, and
- Haskell language bindings to libmarquise
Hark! A bottleneck!
Experimentation identified a couple of places where reality didn’t quite live up to expectations. In theory, Ceph provides us with a flat, horizontally scaling namespace of objects that can be written to concurrently. If we need more capacity or speed, we just add more nodes to the cluster, and so on ad infinitum.
In our testing the bottleneck was not the speed of the disks or network, nor the total number of objects. Rather, much to our surprise, it was the number of concurrent writes we could have in-flight to Ceph. More precise benchmarks quickly identified the problem: as the number of operations per second (op/s) increased past the capacity of the cluster, they would simply back up and complete later — sometimes much later.
We managed to track this down to the underlying journalling mechanism of Ceph, and, in particular its interaction with the btrfs filesystem we had been using. After migrating our Ceph cluster to xfs, the problem all but went away. Unfortunately, the number of op/s we were chewing through was still on the high side and could potentially impact the quality of service for others trying to utilise the same Ceph cluster.
Fix it! Fix it!
In a rushed attempt to make the problem go away, we decided to coalesce these writes together in a buffer for a certain period before writing out to Ceph. This brought up to a 2:1 reduction in ops/s at the price of a memory usage blowout, dire scaling prospects, and, worse yet, higher latency on acknowledgment of writes. This was not an acceptable solution, and as we put more load on the system, memory usage rose in unison. Memory usage wasn’t the only problem either. When receiving a spike of traffic the clients would have to wait a long time to receive acknowledgments. After a while waiting, the clients would give up and re-send, compounding the problem.
Maybe that was a bad idea
Whilst keeping copious amounts of work in memory is almost always a bad idea, the concept of buffering writes was sound. In this vein, we implemented a journalling daemon called “bufferd”. This daemon has one job: eat bytes off the wire and append them to a journal, without processing the data itself. This worked surprisingly well, and gave our workers a fixed pool of work to process at their leisure. Furthermore, acknowledgments now came back reliably and quickly.
The journal saved the day for us, but we are under no delusions that it is a good solution. We should not need a journal, it is a band-aid for our biggest design flaw: a complex file format.
Variable-length file formats are really slow.
Rumour has it that Google stores all of their logs as protobufs and can perform map-reduce jobs on them in this format. This provided us with confidence when adopting protobufs for our on-disk format.
The problem with variable-length file formats is that they make memory allocation difficult. The Haskell library that we are using to parse protobufs gets around this problem by allocating “objects”, or, data types. The library in question is a work of art in terms of elegance and conciseness, but it brought garbage collectors to their knees when executing at scale.
We did make a concerted effort to improve the performance of the library. We actually made a lot of progress which will make protobuf parsing for everyone who uses Haskell much faster. Unfortunately though, it’s not fast enough for us; we’re currently burning up 16 CPUs cores writing all of our metrics to disk.
Needless to say it’s time to iterate on what we’ve learned.
What we learned
- It’s important to speculate on bottlenecks, identify them and pin down their locations early. This might seem like make-work, but by thinking about the bottlenecks early you will know precisely where time spent optimising actually pays off.
- Allowing someone to change the “type” of a metric on the fly (from say, real numbers, to a string) added a lot of unnecessary complexity to the interpolation of any given metric. In the re-design we will be restricting a given measurement stream to one type. This restriction has allowed us to design a much simpler and faster file format.
- Simple, fixed wire formats are much faster to parse. Reduce complexity everywhere it can be afforded; always go for the simplest option, not the fastest or nicest. You can optimise later, at the bottlenecks, because you know where those are!
- Protocol Buffers are slow for file formats, but good for serialising requests. Yes, we have a lot to say on how bad protobufs were for us as a file format. However, when used in the manner they were designed for, protobufs prove to be an excellent tool for standardising on request formats between various components.
- Polyglot development can be counterproductive, especially in a small team. Yes, some languages are better suited to particular problems, but this must be balanced against the drawbacks. Drawbacks include: hesitancy to collaborate, lack of knowledgeable in code review, and developer “rabbit holing”. When taken in perspective, using multiple languages for a given project can only reduce unity and overall understanding of each other’s work. For this reason, the re-design of Vaultaire is being written entirely in the language we have chosen for all our future work, Haskell.
- Metrics are really useful! We even added them to our client-side library, which allowed us to see when clients were getting backed up on failed and deferred requests, and measure things like end-to-end latency, across the whole cluster.
On the upside
The current stable version of Vaultaire has been running in production for almost four months now without losing a single data point. We are able to search and visualise all of the metrics in the Vault, and it’s not too slow at all: retrieving a range of data points for a two week period happens in under five seconds.
Vaultaire has now been re-designed from scratch and development of a new version, version two, is almost finished. Initial benchmarks of the new version show several orders of magnitude improvement in performance, and it’s much simpler.
Initially our data vault is for internal consumption, but as we become happy with its stability and add authentication functionality we intend to provide this capability as a service to our customers.