It’s been an exciting weekend for those running Linux systems, with many popular websites taken offline as a result of a kernel bug relating to the addition of a leap second at the end of June. This follows hot on the heels of an outage that saw many Instagram users tragically unable to upload pictures of their breakfast.
There’s no shortage of stories in the media about the problem if you care to look around, but we were rather disappointed by the lack of coverage with much substance. We’ve also not yet found an explanation that can be understood by someone who isn’t intimately familiar with the internals of the Linux kernel. So we put on our detective hat and started looking.
Update: if you are familiar with the innards of Linux, LWN.net has published an explanatory writeup with a bit more clarity on the technical details.
In case you’re unfamiliar with the problem, leap seconds are periodically added to our atomic clocks (what we measure time with) to ensure they correspond to the solar time that we observe as humans. This is necessary because the earth’s rotation is slowing down, ever so slightly, and we correct our timekeepers to match.
Exactly how computers handle this varies a little, but generally speaking, the Network Time Protocol (NTP) informs computers of an impending leap second and then it’s up to the operating system’s kernel to deal with it. When handled incorrectly, applications on the server can misbehave. This is what we saw over the weekend, and according to reports it affected a number of prominent companies including Reddit, Mozilla, Gawker media, Foursquare and LinkedIn.
Noone seemed to have any further details on the problem though, and we’d seen some symptoms that we suspected were related, so we went digging. Based on what we’ve read, it looks like there’s possibly two problems here, though it could be two facets of the same problem.
One, which Redhat’s engineers are also researching, appears to be a possible livelock in the kernel that occurs whenever then the kernel is informed of an impending leap second by NTP. The second is a mismatch between two kernel timekeeping facilities, following the addition of a leap second.
After our monitoring systems picked up some odd behaviour over the weekend, we’re pretty sure we’d seen symptoms of the latter. It’s not entirely clear at this stage, but our understanding is as follows:
- The kernel contains multiple clocks for different purposes; for the sake of explanation, assume that we have two clocks
- Following a leap second, only one of these clocks is adjusted. One clock is now running fast by 1sec (the uncorrected clock), while the other is on-time (the corrected clock)
- Some applications set timers, saying “please wake me up after a certain number of seconds have passed”. The duration for the timer (say, “five seconds from now”) is converted to a specific time on the clock (eg. “10:00am + 5sec”). The corrected clock is used to calculate when the timer will expire
- The kernel goes around checking all the timers when woken by an interrupt, to see if any of them have expired yet. The uncorrected clock is used to check the timers, meaning that the timers expire 1sec earlier than they should
- Some timers are set with a duration of less than 1sec, like 0.5sec. This means that as soon as they’re checked, they’re considered to have expired. This is unexpected, but shouldn’t be a huge problem on its own
- It’s normal for some applications to set these timers repeatedly. They’ll ask to be woken up, check if they have any work to do, then go back to sleep for less than a second
- The catch is that setting a timer and checking it again isn’t free, it takes a little bit of CPU power. An application that sets lots of timers will be allowed to do so, as fast as the CPU can manage it, leaving no computing power for anything else.
Though serious, our problem has a straightforward solution. Once we resync the uncorrected clock to match reality, the timers will stop expiring immediately and causing all this load.
Using the low level system call
settimeofday is enough to nudge the uncorrected clock back into sync, and the easiest way to do that is with the system’s
date utility. We have a lot of systems that might need fixing though; at least several hundred. It’d take forever to login to every single one and run the command!
Enter Puppet. We pushed a small job to run the command into our manifests, then waited for Puppet to do its periodic run on all of our managed servers. The number of unmanaged servers is quite small, so we could manually login to any of those and mop-up as needed.
Puppet can be a frustrating beast at times, but it works fantastically when you have a mountain of servers to maintain. The time savings from having such a tool in this case is tremendous – once the issue was identified, it was about half an hour’s work to update our Puppet manifests and roll them out to production.
Pretty much everyone affected by the bug will have cleared the symptoms by correcting the kernel’s clock or rebooting – there’s no conclusive kernel patch yet because the problem is still being discussed on the kernel developers’ mailing list. We expect it’ll probably be fully thrashed out and explained sometime this week, so we look forward to reading all about it.
Update: that was quick! LWN.net has published an explanatory writeup. No mention of the livelock issue though.