Bugfixing KVM live migration

By November 17, 2015 Technical

Here at Anchor we really love our virtualization. Our virtualization platform of choice, KVM, lets us provide a variety of different VPS products to meet our customers’ requirements.

Our KVM hosting platform has evolved considerably over the six years it’s been in operation, and we’re always looking at ways we can improve it. One important aspect of this process of continual improvement, and one I am heavily involved in, is the testing of software upgrades before they are rolled out. This post describes a recent problem encountered during this testing, the analysis that led to discovering its cause, and how we have fixed it. Strap yourself in, this might get technical.

The bug’s first sightings

Until now, we have built most of our KVM hosts on Red Hat Enterprise Linux 6 — it’s fast, stable, and supported for a long time. Since the release of RHEL 7 a year ago we have been looking to using it as well, perhaps even to eventually replace all our existing RHEL 6 hypervisors.

Of course, a big change like this can’t be made without a huge amount of testing. One set of tests is to check that “live migration” of virtual machines works correctly, both between RHEL 7 hypervisors and from RHEL 6 to RHEL 7 and back again.

Live migration is a rather complex affair. Before I describe live migration, however, I ought to explain a bit about how KVM works. KVM is itself just a Linux kernel module. It provides access to the underlying hardware’s virtualization extensions, which allows guests to run at near-native speeds without emulation. However, we need to provide our guests with a set of “virtual hardware” — things like a certain number of virtual CPUs, some RAM, some disk space, and any virtual network connections the guest might need. This virtual hardware is provided by software called QEMU.

When live migrating a guest, it is QEMU that performs all the heavy lifting:

  1. QEMU synchronizes any non-shared storage for the guest (the synchronization is maintained for the duration of the migration).
  2. QEMU synchronizes the virtual RAM for the guest across the two hypervisors (again for the duration of the migration). But remember, this is a live migration, which means the guest could be continually changing the contents of RAM and disk, so…
  3. QEMU waits for the amount of “out-of-sync” data to fall below a certain threshold, at which point it pauses the guest (i.e. it turns off the in-kernel KVM component for the guest).
  4. QEMU synchronizes the remaining out-of-sync data, then resumes the guest on the new hypervisor.

Since the guest is only paused while synchronizing a small amount of out-of-sync RAM (and an even smaller amount of disk), we can limit the impact of the migration upon the guest’s operation. We’ve tuned things so that most migrations can be performed with the guest paused for no longer than a second.

So this is where our testing encountered a problem. We had successfully tested live migrations between RHEL 7 hypervisors, as well as from those running RHEL 6 to those running RHEL 7. But when we tried to migrate a guest from a RHEL 7 hypervisor to a RHEL 6 one, something went wrong: the guest remained paused after the migration! What could be the problem?

Some initial diagnosis

The first step in diagnosing any problem is to gather as much information as you can. We have a log file for each of our QEMU processes. Looking at the log file for the QEMU process “receiving” the live migration (i.e. on the target hypervisor) I found this:

KVM: entry failed, hardware error 0x80000021

If you're running a guest on an Intel machine without unrestricted mode
support, the failure can be most likely due to the guest entering an invalid
state for Intel VT. For example, the guest maybe running in big real mode
which is not supported on less recent Intel processors.

RAX=ffffffff8101c980 RBX=ffffffff818e2900 RCX=ffffffff81855120 RDX=0000000000000000
RSI=0000000000000000 RDI=0000000000000000 RBP=0000000000000000 RSP=ffffffff81803ef0
R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 R11=0000000000000000
R12=ffffffff81800000 R13=0000000000000000 R14=00000000ffffffed R15=ffffffff81a27000
RIP=ffffffff81051c02 RFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 0000000000000000 ffffffff 00c00100
CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0000 0000000000000000 ffffffff 00c00100
FS =0000 0000000000000000 ffffffff 00c00100
GS =0000 ffff88003fc00000 ffffffff 00c00100
LDT=0000 0000000000000000 ffffffff 00c00000
TR =0040 ffff88003fc10340 00002087 00008b00 DPL=0 TSS64-busy
GDT=     ffff88003fc09000 0000007f
IDT=     ffffffffff574000 00000fff
CR0=8005003b CR2=00007f6bee823000 CR3=000000003d2c0000 CR4=000006f0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000d01
Code=00 00 fb c3 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 fb f4  66 66 66 66 2e 0f 1f 84 00 00 00 00 00 f4 c3 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00

What appears to have happened here is that the entire migration process worked correctly up to the point at which the QEMU process needed to resumed the guest… but when it tried to actually resume the guest, it failed to start properly. QEMU dumps out the guest’s CPU registers when this occurs. “Hardware error 0x80000021” is unfortunately a rather generic error code — it simply means “invalid guest state”. But what could be wrong with the guest state? It was just running a moment ago on the other hypervisor; how did the migration make it invalid, if live migration is supposed to copy every part of the guest state intact?

Given that all of our other migration tests were passing, what I needed to do was compare this “bad” migration with one of the “good” ones. In particular, I wanted to get the very same register dump out of a “good” migration, so that I could compare it with this “bad” migration’s register dump.

QEMU itself does not seem to have the ability to do this (after all, if a migration is successful, why would you need a register dump?), which meant I would have to change the way QEMU works. Rather than patching the QEMU software then and there, I found it easiest to modify its behaviour through GDB. By attaching a debugger to the QEMU process, I could have it stop at just the right moment, dump out the guest’s CPU registers, then continue on as if nothing had occurred:

# gdb -p 8332
...
(gdb) break kvm_cpu_exec
Breakpoint 1 at 0x7f25ec576050: file /usr/src/debug/qemu-2.4.0/kvm-all.c, line 1788.
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>call cpu_dump_state(cpu, stderr, fprintf, CPU_DUMP_CODE)
>disable 1
>continue
>end
(gdb) continue
Continuing.
[Thread 0x7f2596fff700 (LWP 8339) exited]
[New Thread 0x7f25941ff700 (LWP 8357)]
[New Thread 0x7f2596fff700 (LWP 8410)]
[New Thread 0x7f25939fe700 (LWP 8411)]
[Thread 0x7f25939fe700 (LWP 8411) exited]
[Thread 0x7f2596fff700 (LWP 8410) exited]
[Switching to Thread 0x7f25d8533700 (LWP 8336)]

Breakpoint 1, kvm_cpu_exec (cpu=cpu@entry=0x7f25ee8cc000) at /usr/src/debug/qemu-2.4.0/kvm-all.c:1788
1788    {
[Switching to Thread 0x7f25d7d32700 (LWP 8337)]

Success! This produced a new register dump:

RAX=ffffffff8101c980 RBX=ffffffff818e2900 RCX=ffffffff81855120 RDX=0000000000000000
RSI=0000000000000000 RDI=0000000000000000 RBP=0000000000000000 RSP=ffffffff81803ef0
R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 R11=0000000000000000
R12=ffffffff81800000 R13=0000000000000000 R14=00000000ffffffed R15=ffffffff81a27000
RIP=ffffffff81051c02 RFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=1
ES =0000 0000000000000000 ffffffff 00000000
CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0000 0000000000000000 ffffffff 00000000
FS =0000 0000000000000000 ffffffff 00000000
GS =0000 ffff88003fc00000 ffffffff 00000000
LDT=0000 0000000000000000 000fffff 00000000
TR =0040 ffff88003fc10340 00002087 00008b00 DPL=0 TSS64-busy
GDT=     ffff88003fc09000 0000007f
IDT=     ffffffffff574000 00000fff
CR0=8005003b CR2=00007f0817db3000 CR3=000000003a45d000 CR4=000006f0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000d01
Code=00 00 fb c3 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 fb f4  66 66 66 66 2e 0f 1f 84 00 00 00 00 00 f4 c3 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00

So now I was able to compare this “good” register dump from the previous “bad” one. The most important differences seemed to be related to the “segment registers”:

bad:  ES =0000 0000000000000000 ffffffff 00c00100
good: ES =0000 0000000000000000 ffffffff 00000000

bad:  DS =0000 0000000000000000 ffffffff 00c00100
good: DS =0000 0000000000000000 ffffffff 00000000

bad:  FS =0000 0000000000000000 ffffffff 00c00100
good: FS =0000 0000000000000000 ffffffff 00000000

bad:  GS =0000 ffff88003fc00000 ffffffff 00c00100
good: GS =0000 ffff88003fc00000 ffffffff 00000000

bad:  LDT=0000 0000000000000000 ffffffff 00c00000
good: LDT=0000 0000000000000000 000fffff 00000000

Those fields at the end contained different values in the “bad” and “good” migrations. Could they be the cause of the “invalid guest state”?

Memory segmentation

To understand what’s going on here, we need to know a bit about how x86 memory segmentation works. Once upon a time, this was really simple: a 16-bit CS (code segment), DS (data segment) or SS (stack segment) register was simply shifted by 4 bits and added to a 16-bit offset in order to form a 20-bit absolute address.

But “protected mode” (introduced in the Intel 80286) complicated things greatly. Instead of a 16-bit segment number, each segment register held:

  1. a 16-bit “segment selector”;
  2. a “base address” for the segment;
  3. the segment’s “size”;
  4. a set of “flags” to keep track of things like whether the segment can be written to, whether the segment is actually present in physical RAM, and so on.

These are the four fields you can see in the segment registers shown above.

But hang on… this guest wasn’t running in “protected mode”. It was a 64-bit guest running a 64-bit operating system; it was running in what’s called “long mode”, and for the most part long mode doesn’t have segmentation. The particular values in the segment registers listed above are mostly irrelevant, because the CPU isn’t actively using those registers.

So at this point I knew that the segment registers had different flags in the “bad” migration than they did in the “good” migration. But if the registers weren’t being used, why would the flags matter?

“Unusable” memory segments

It took a fair bit of trawling through QEMU and kernel source code and Intel’s copious documentation before I found the answer. It turns out that there is a hidden flag, not visible in these register dumps, indicating whether a particular segment is “usable” or not. The usable flags are not part of the register dumps because they’re not really part of a guest’s CPU state; instead, they’re used by a hypervisor to tell the host CPU which of a guest’s segment registers should be loaded when a guest is started — and most importantly, this includes the times a guest is resumed immediately following a migration.

Next up, I needed to see how KVM and QEMU dealt with these “unusable” segments. So long as each register’s “unusable” flag is included in the migration, then the complete guest state should be recoverable after a migration.

Interestingly, it seems that QEMU does not track the “unusable” flag for each segment. The two functions (set_seg and get_seg) responsible for translating between KVM’s and QEMU’s representations of these segment registers would throw away a “unusable” flag when retrieving it from the kernel, and always clear it when loading the register back into the kernel. How could this ever have worked correctly?

This was finally answered when I looked at the kernel versions involved:

  • On the RHEL 6 kernel, when retrieving a guest’s segment registers the kernel would automatically clear the flags for a segment if the segment was marked “unusable”. When loading the guest’s segment registers again, it would treat a segment with a cleared set of flags as if it were “unusable”, even if QEMU had not said so.
  • On the RHEL 7 kernel, however, the kernel would not touch the flags at all when they were retrieved. On loading the segment registers again, it would treat a segment as “unusable” only if QEMU said so, or if one specific flag — the “segment is present” flag — were not set.

Although these kernels have different behaviour, they both work correctly if you stick to one kernel in a guest migration. But if you try to migrate a guest from a RHEL 7 hypervisor to a RHEL 6 hypervisor, the flags aren’t cleared and the new kernel doesn’t know the register should be automatically marked unusable. The result is that the guest tries to use an invalid segment register, so the hardware throws an “invalid guest state” error. Bingo — that’s exactly what we’d seen!

The fix

The fix turned out to be quite simple: simply have QEMU also clear the flags of any segment registers that are marked unusable, and have it ensure that segment registers whose “present” flags are cleared are also marked unusable when loading them into the kernel:

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 80d1a7e..588df76 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -997,7 +997,7 @@ static void set_seg(struct kvm_segment *lhs, const SegmentCache *rhs)
     lhs->l = (flags >> DESC_L_SHIFT) & 1;
     lhs->g = (flags & DESC_G_MASK) != 0;
     lhs->avl = (flags & DESC_AVL_MASK) != 0;
-    lhs->unusable = 0;
+    lhs->unusable = !lhs->present;
     lhs->padding = 0;
 }
 
@@ -1006,14 +1006,18 @@ static void get_seg(SegmentCache *lhs, const struct kvm_segment *rhs)
     lhs->selector = rhs->selector;
     lhs->base = rhs->base;
     lhs->limit = rhs->limit;
-    lhs->flags = (rhs->type << DESC_TYPE_SHIFT) |
-                 (rhs->present * DESC_P_MASK) |
-                 (rhs->dpl << DESC_DPL_SHIFT) |
-                 (rhs->db << DESC_B_SHIFT) |
-                 (rhs->s * DESC_S_MASK) |
-                 (rhs->l << DESC_L_SHIFT) |
-                 (rhs->g * DESC_G_MASK) |
-                 (rhs->avl * DESC_AVL_MASK);
+    if (rhs->unusable) {fix
+        lhs->flags = 0;
+    } else {
+        lhs->flags = (rhs->type << DESC_TYPE_SHIFT) |
+                     (rhs->present * DESC_P_MASK) |
+                     (rhs->dpl << DESC_DPL_SHIFT) |
+                     (rhs->db << DESC_B_SHIFT) |
+                     (rhs->s * DESC_S_MASK) |
+                     (rhs->l << DESC_L_SHIFT) |
+                     (rhs->g * DESC_G_MASK) |
+                     (rhs->avl * DESC_AVL_MASK);
+    }
 }
 
 static void kvm_getput_reg(__u64 *kvm_reg, target_ulong *qemu_reg, int set)

With both of these changes in place, a migration would work even if we were migrating to or from an “old” version of QEMU without the fix. Moreover, it would mean we could get the fix rolled out without having to change the kernels involved.

At present we are still testing these changes, however we look forward to working with the upstream QEMU developers in order to have them added to the mainline version of QEMU.

In writing this blog post I’ve skipped over many of the dead-ends I took in solving this problem. While the fix ended up reasonably straight-forward (well, as much as can be expected when you’re dealing with kernels and hypervisors) it was a fun and educational journey getting there.

Got a question or comment? We’d love to hear from you!

2 Comments

  • Yuri says:

    This basically doesn’t work here. The guest is indeed now unsuspended, but it fails to continue operation. Connecting to console / graph display shows nothing.

    I hope I understood this properly and the patch is needed for EL7 only.

    • Michael Chapman says:

      Hi Yuri,

      I’m not sure about the problem you’re hitting — we haven’t seen anything like that in any of our tests. You might want to see whether saving and restoring a guest from disk works. That uses much of the same code that’s used in migration, but it’s a fair bit simpler (no network connections, no need to block-migrate non-shared disks, etc.)

      This patch isn’t exactly EL7-specific — if you were migrating a guest from EL7 to EL7 you wouldn’t actually hit this bug. It specifically fixes a bug that only arises when migrating a guest from a new kernel version to an older kernel version (e.g. from EL7 to EL6). However, the patch is written in such a way that the bug will be avoided even if only one of the source or target QEMU binaries contains the patch — ideally you’d have both patched, of course, but that’s not always possible.

Leave a Reply