Overview of Checkpoint and Restore – live-migrating processes on a Linux system

February 1, 2013 Technical, General

We’re attending Linuxconf 2013 this week, being held down in our fair capital Canberra. There’s been some great talks so far, we thought we’d share one of the most interesting with you.

In a nutshell, Checkpoint and Restore In Userspace (CRIU) is the ability to take a point-in-time snapshot of a running process (checkpoint), and revive it later, either on the same system or another system (restore). We’ll go over the difficulties in pulling this off, and what it’s good for.

Problems – the rabbit hole goes much deeper

At first blush, this sounds simple enough – dump the process’ memory and stash it away, then later restore it and fix up a few references in the kernel, too easy! Not so fast there, there’s a lot of subtle problems to be solved.

Early work on CRIU took a very naive approach to the problem. As well as dumping memory, it made use of procfs, syscalls, and netlink sockets to gather information from the kernel.

File descriptors are a good example of this; for the process to keep running it needs all its file handles and sockets to be available when it’s restored. CRIU doesn’t manage the actual files or filesystems (that’s your job), but it provides for a degree of translation to take into account things like moved/covered mountpoints and similar.

This’ll get you most of the way there, and you can massage the information back into the kernel to perform a restore. For things that didn’t fit these interfaces, some small patches to the kernel completed the functionality.

So how about PIDs? If you’re wanting to migrate the process to another host, the PID had better be unused on the destination system. A process’ PID can’t change, as it has a parent and child relationship with other processes, which generally rely on having the PID. Consider also that some daemons write a pidfile to aid other processes in interacting with them.

Some parts of a process’ state isn’t even directly tied to the process itself in the kernel. Examples of this are outstanding signals and various buffers (eg. network sockets that haven’t been read yet). These all need to be tracked down, stashed away, and carefully setup again to prevent any nasty surprises when the process is woken up. Code was written to allow extended peeking on sockets, so they can be inspected and replicated.

What about interprocess matters? Pipes are a common IPC technique, and they need to be correctly plumbed on the destination. kcmp is functionality written for CRIU to find matching references in the kernel so this can be done, which has been merged into mainline.

Finally, CRIU introduces what it calls TCP Repair Mode to solve networking problems. Rather than attempt to directly mangle connection states in the networking stack, repair mode disables all traffic on the wire, allowing normal syscalls (eg. connect, send) to update the connection’s state without producing any external effects.

As we’re sure you can see, there’s a lot to do. To wrap up the rest of the list, CRIU deals with: multitasking and multithreading, on x86-64 and ARM; all types of memory maps; process groups, sessions and terminals; namespaces; process credentials; open files (including shared and unlinked files); tcp, udp and unix sockets; tcp connections. What’s more, all the functionality to support this is in the kernel as of version 3.7.

Why we’re excited

It’s fair to ask just what this is good for. In short, a lot of things.

What CRIU represents is a way to bundle up all of an application’s state cleanly and later redeploy it. The first application we thought of was High Availability (HA).

One widely used HA deployment method is to use a pair of servers, in an Active/Standby configuration. This incurs a noticeable amount of downtime in the event of a failure, as the standby server can’t seamlessly pick up the pieces. Enter CRIU.

You could use CRIU to constantly send process snapshots to the Standby server. When a failure of the Active server is detected, the last complete CRIU snapshot can be restored on the Standby server, and service continues seamlessly.

To give a specific example, MySQL with a SAN storage backend would be an ideal usage case. The use of shared storage ensures the same data is available to both servers, and CRIU can migrate the service if one servers going down, open connections and all.

CRIU isn’t a panacea, but we can see plenty of opportunities in our fully managed environments to provide an improved HA experience. Even in non-HA environments, it’d let us perform maintenance with greater flexibility and less disruption to customers. That’s definitely something worth looking forward to.

Excited about running smarter systems that are just plain cool? We’re hiring.