leap seconds and Linux

On June 30, 2012 a leap second was inserted into UTC which caused a fair amount of difficulty for companies across the Internet. Some explanation of leap seconds, the problems with it that exist in the Linux kernel, and solutions to it follows.

What are leap seconds?

A leap second is a one second adjustment that is applied to UTC in order to prevent it from deviating more than 0.9 seconds from UT1 (mean solar time). It can be positive or negative and is implemented by adding 23:59:60 or skipping 23:59:59 on the last day of a given month (usually June 30 or December 31). Since the UTC standard was established in 1972, however, 25 leap seconds have been scheduled and all of them have been positive.

Since they are dependent on climatic and geologic events that affect the Earths moment of inertia (mostly tidal friction), leap seconds are irregularly spaced and unpredictable. The International Earth Rotation and Reference Systems Service (IERS) is responsible for deciding when leap seconds will occur, and announces them about six months in advance. The most recent leap second was inserted on June 30, 2012 at 23:59:60 UTC. It has been announced that there will not be a leap second on December 31, 2012.

What problems do leap seconds cause?

Leap seconds are problematic in computing for a number of reasons. As an example, to compute the elapsed seconds between two UTC dates in the past requires a table of leap seconds which must be updated whenever one is announced. It is also impossible to calculate accurate time intervals for UTC dates farther in the future than the interval of leap second announcements. There are more practical problems dealing with distributed systems that depend on accurate time stamping of series data.

In particular, there have been problems with the implementation of leap second handling in the Linux kernel itself. When the last leap second occurred on June 30, 2012 this caused outages at reddit (Apache Cassandra), Mozilla (Hadoop), Qantas Airlines, and other sites. Generally speaking, leap second problems on Linux hosts are characterized by high CPU usage of certain processes immediately after application of a leap second to the local clock.

In one particular case, tgtd (scsi-target-utils) on CentOS 6 hosts began generating an average 14,000 log messages per second:

Jun 30 23:59:59 host kernel: Clock: inserting leap second 23:59:60 UTC
Jun 30 23:59:59 host tgtd: work_timer_evt_handler(89) failed to read from timerfd, Resource temporarily unavailable
Jun 30 23:59:59 host tgtd: work_timer_evt_handler(89) failed to read from timerfd, Resource temporarily unavailable
Jun 30 23:59:59 host tgtd: work_timer_evt_handler(89) failed to read from timerfd, Resource temporarily unavailable

This caused the root file system of approximately 600 hosts to become full before the issue was mitigated.

Why do these problems occur?

The last leap second exposed a kernel bug that can affect any threaded application. It is most apparent with applications that use sub-second CLOCK_REALTIME timeouts in a loop, usually connected with futexes.

On July 3, 2007 commit 746976a301ac9c9aa10d7d42454f8d6cdad8ff2b (2.6.22) removed clock_was_set() in seconds_overflow() to prevent a deadlock. Due to this patch the following occurs when a leap second is added to UTC:

  • The leap second occurs and CLOCK_REALTIME is set back by one second
  • clock_was_set() is not called by seconds_overflow() so the hrtimer base.offset value for CLOCK_REALTIME is not updated
  • CLOCK_REALTIME’s sense of wall time is now one second ahead of the timekeeping core’s
  • At interrupt time, hrtimer code expires all CLOCK_REALTIME timers that are set for ($interrupt_time + 1 second) and before

At this point all TIMER_ABSTIME CLOCK_REALTIME timers now expire one second early. Even worse, all sub-second TIMER_ABSTIME CLOCK_REALTIME timers will return immediately. Any applications that use such timer calls in a loop will experience load spikes. This situation persists until clock_was_set() is called, for example, via settimeofday().

On July 13, 2012, Linus merged several commits in d55e5bd0201a2af0182687882a92c5f95dbccc12 (3.5-rc7) which, beyond simply providing clock_was_set_delayed() in hrtimer to resolve the problem, included other rework of hrtimer and timekeeping.

Affected Kernels

This problem has existed since kernel 2.6.22. All kernels from 2.6.22 to 3.5-rc7 are presumably affected. All RHEL 5.x kernels already include a patch to avoid this bug. Unfortunately, Red Hat either neglected to patch, or mispatched, RHEL 6 for the same issue. All RHEL 6 kernels are vulnerable to this problem with patches available in the following updates;

  • RHEL 6.3: kernel-2.6.32-279.5.2
  • RHEL 6.2 Extended Updates: kernel-2.6.32-220.25.1.el6
  • RHEL 6.1 Extended Updates: kernel-2.6.32-131.30.2

In Debian and it’s derivatives this issue is patched in the following kernel updates;

  • Debian 6.x (squeeze): linux-image-2.6.32-46
  • Debian 7.x (wheezy): linux-image-3.2.29-1

Resolution

Quite obviously, the most prudent fix is to apply a patched kernel package to the affected host, or upgrade to upstream > 3.5-rc7. If a given host cannot be patched, it is possible to manually call settimeofday() after a leap second is applied by issuing either of the following;

date -s "`LC_ALL=C date`"
date `date +'%m%d%H%M%C%y.%S'`

Doing so will resolve any present issues on the host in question.

Another interesting approach to solving this problem was devised by Google, which they call “Leap Smear”. Since Google run their own stratum 2 NTP servers they patched NTP to not issue LI (leap indicator) and instead “smear” a leap second by modulating ‘lie’ over time window w before midnight;

lie(t) = (1.0 – cos(pi * t / w)) / 2.0

You can read more about the leap smear technique at their blog.