One of our customers is deploying a big new app, and they’re using a lot of cool technologies that we haven’t used before. Awesome!
Building the cluster is straightforward enough, but we’re not satisfied with just that; we never are. How does it work? We want to know what’s going on, we need to know. That’s what makes our support so good, a proper understanding of the things that we run.
One of the pieces of software at the core of the new app is Apache Kafka. It sounds a bit like an AMQP implementation, but that’s all we know at this stage.
David, a member of our projects team, loves tinkering with thing, so he set about putting together a little tech demo. It didn’t need to be fancy at all, just something functional to play with, so we can examine Kafka’s behaviour when we run various scenarios in the client code.
We’re getting a little ahead of ourselves though, just what is Kafka?
Kafka is a distributed pub-sub messaging system. Its distinction is that it’s designed to handle really large amounts of data with high throughput in a realtime manner.
The customer will be using Kafka to push messages between the client-facing frontends, and the backend processing/datastore machines. Kafka nodes can be clustered for availability and extra capacity, which we’ve done.
Kafka’s message handling model, David decided, really sounded like IRC. We could make use of some Web2.0 technologies and knock together a nifty little browser-based chat system, with enough backend capacity to handle about three-gajillion users sending umpty-billion messages per second.
It was to be called Pollchat, because it seemed like a good idea at the time.
We can’t talk directly to Kafka with Websockets, so we need an intermediary. David hacked up a shim in Node.js to marshal messages between the Websockets clients and the Kafka backends.
Node.js is great because it’s very high performance and light on resource usage. If we had to scale-up this chat system, we could deploy more Node.js servers very easily. The Node.js server also hosts the client’s webpage, something lightweight like nginx is ideal for this.
Then there’s Kafka, sitting there ready to push messages around. A chatroom (“channel” in IRC parlance) corresponds to a “topic” for publishing to when it comes to Kafka. When a client joins a chatroom, they subscribe to that topic. Anything they say in the room is sent as a message to that topic, which Kafka duly distributes to all the subscribed clients.
One of the nifty things about Kafka is that it’s designed to be persistent; this means that you get chat history for free. When a client connects, they can request
--from-beginning and get everything that the server has stored. In practice you don’t keep everything forever, you can tell Kafka to discard old data based on size or age.
That’s all there is to it, it’s pretty simple. We got Pollchat running in the office and had about twenty people chatting to each other with fantastically hilarious randomly chosen names (the hand-picked dictionary contains things like toxicologist, Münchhausen, bylaw, rotogravure, breathless and cumquat).
It’s a cute little demo of how easy it is to throw things together and produce a useful system. In the process we learnt more about how the configuration and replication works, and how we can monitor the whole system for health. Extending the code is easy, which allowed us to see what happens when multiple connections to separate Kafka nodes are involved.
Of course we’re publishing the code, so if you want to try it out, have a gander at the pollchat repo on Github. If you follow the quickstart setup guides for Kafka you shouldn’t have too much trouble getting things working.
In testing we noticed that there’s about 1sec of lag between posting a message and it appearing on all the other clients, which we figured out was due to how often Kafka commits messages to disk. Because Kafka assures that messages don’t get lost, they need to be written out to disk before they’re forwarded to subscribers. The developers have chosen to flush messages to disk every second, which explains the lag that we saw.
We think it’s an interesting way to do things, but it gets the job done. As they note, the focus is on throughput and not latency, so while it’s not ideally suited to this sort of usage, it gets the job done.