A most unusual temporal issue

Most of us have suffered time-related woes, dealing with servers that just can’t keep their clock straight even with NTP. Well we’ve just come across a new one, and NTP isn’t going to save the day – what happens when you have multiple misbehaving clocks?

We first noticed the problem when reloading the firewall on the host, a KVM server that had been virtualised from hardware. We use filtergen for ruleset generation, and one of its postreload scripts restarts fail2ban, an anti-bruteforce tool. One part of fail2ban’s initscript makes a call to sleep 1 to give things time to settle.

We noticed that the invocation of sleep was taking more than a second, much more. Something wasn’t right and the automatic 15sec rollback was kicking in, having not received explicit confirmation from the operator. We tested some more, this wasn’t a filtergen or fail2ban problem.

[email protected]:~# time sleep 1

real	1m25.376s      ಠ_ಠ
user	0m0.000s
sys	0m0.001s

This was reminding us of the leap second issue that we dealt with last year, in which timers weren’t being handled at the correct time, causing some big problems. Could the system’s clock be jumping all over the place and causing sleep 1 to last far too long?

Running date in a tight loop from the shell (about 1000 times per second) we could see time advancing normally, for the most part. There was a “blip” every several invocations showing the wrong time, and it was pretty regular.

Sat Feb 30 16:37:04 EST 2013
Sat Feb 30 16:37:04 EST 2013
Sat Feb 30 16:35:39 EST 2013    <-- anomaly
Sat Feb 30 16:37:04 EST 2013
Sat Feb 30 16:37:04 EST 2013
Sat Feb 30 16:37:04 EST 2013
Sat Feb 30 16:37:04 EST 2013
Sat Feb 30 16:35:39 EST 2013    <-- anomaly
Sat Feb 30 16:37:04 EST 2013

This is when we considered that we might be dealing with multiple clocks. In the case of sleep 1, the operating system sets an interrupt to wake the program after the specified period has elapsed. The interrupt is a hardware feature, and in a multicore system there’s no guarantee that the interrupt will be handled by the same CPU that set it initially.

Suspecting that a CPU clock was out of whack, we dived into the toolbox and pulled out taskset, a utility for pinning processes to a specific CPU. It’s basic but would do the job for diagnosis. The VM has been assigned six CPU cores, numbered 0 through to 5. Pinning the process to each CPU in turn, we asked for the current timestamp.

[email protected]:~# for i in `seq 0 5`; do echo -n "$i: "; taskset -c $i date; done

0: Sat Feb 30 16:48:28 EST 2013
1: Sat Feb 30 16:49:52 EST 2013
2: Sat Feb 30 16:49:52 EST 2013
3: Sat Feb 30 16:49:52 EST 2013
4: Sat Feb 30 16:49:52 EST 2013
5: Sat Feb 30 16:49:52 EST 2013

Aha! This explains the symptoms perfectly – the clock on the first CPU is running slow by about 85 seconds. It’s conceivable that the interrupt is set on one of the other CPUs, while the timer interrupt is eventually serviced by CPU 0, adding 85sec to the process from start to finish.

We don’t know the cause, but we’ve got a good handle on the problem now. A reboot will probably fix it, but that will need to be scheduled with the customer for a later time.

Meanwhile, there’s one more test that we’d like to do: perhaps we can “resync” the CPUs so they agree on the same time. Sufficiently modern kernels support CPU hotplugging through a sysfs interface, and Linux will actually let you disable and enable CPUs on the fly, whether or not you have CPUs that are physically hotpluggable! (Probably not)

By echoing a 0 to the control file you can direct the OS to stop scheduling processes on the CPU and deactivate it. Reactivation is effected by echoing a 1 to the file. You can’t take CPU 0 offline as the system and architecture has some special dependencies on it, but everything else is fair game.

for i in 1 2 3 4 5 ; do echo 0 > "/sys/devices/system/cpu/cpu${i}/online" ; done

filtergen safereload

for i in 1 2 3 4 5 ; do echo 1 > "/sys/devices/system/cpu/cpu${i}/online" ; done

We disabled all the CPUs, reloaded the firewall, then enabled all the CPUs again; the firewall reload didn’t hang! As further confirmation, testing showed that date and sleep 1 behaved as expected when all CPUs bar one were disabled.

As for the cause of this problem, we’re stumped. We could attempt to go diving to find out more, but this server is a special-case production system, so there’s only so much we can do without disrupting normal operation. At the very least, we’ve got a satisfactory workaround for our specific issue until the machine can be rebooted and a kernel update applied.