Linux KVM: Openvswitch on Debian Wheezy

Among a great many other things, openvswitch is an alternative to managing your virtual networking stacks for KVM with bridge-utils. It supports VLANs, LACP, QoS, sFlow, and so forth.  Listed below are the steps required to get openvswitch running on Debian 7.0 (wheezy).

This article is written with the presumption that you are running a source-installed kernel (3.6.6 with the openvswitch module in this case), and want to use the latest openvswitch from git.

Install prerequisites

Apply any available updates, get all the build dependencies for openvswitch, and install module-assistant.

apt-get update && apt-get dist-upgrade
apt-get install build-essential
apt-get build-dep openvswitch
apt-get install module-assistant

Prep your environment

bridge-utils has a kernel modules that conflicts with the brcompat module in openvswitch. Lets remove that and at the same time stop libvirt and KVM for a bit.

apt-get remove --purge bridge-utils
/etc/init.d/libvirt-bin stop
/etc/init.d/qemu-kvm stop

Build openvswitch

Clone the openvswitch git repo and build debian packages from it.

git clone git://openvswitch.org/openvswitch
cd openvswitch
dpkg-buildpackage -b

Install the packages you just built.

cd ../
dpkg -i openvswitch-switch_1.9.90-1_amd64.deb openvswitch-common_1.9.90-1_amd64.deb \
openvswitch-brcompat_1.9.90-1_amd64.deb openvswitch-datapath-source_1.9.90-1_all.deb \
openvswitch-controller_1.9.90-1_amd64.deb openvswitch-pki_1.9.90-1_all.deb

Build openvswitch-datapath for your running kernel.

module-assistant auto-install openvswitch-datapath

Configure brcompat to load on startup.

sed -i 's/# BRCOMPAT=no/BRCOMPAT=yes/' /etc/default/openvswitch-switch

Verify your configuration

At this point you should reboot and verify that the proper modules are loaded, the service starts normally, and the status output is correct.

[email protected]:~$ lsmod | grep brcompat
brcompat               12982  0 
openvswitch            73431  1 brcompat

[email protected]:~$ /etc/init.d/openvswitch-switch restart
[ ok ] Killing ovs-brcompatd (5439).
[ ok ] Killing ovs-vswitchd (5414).
[ ok ] Killing ovsdb-server (5363).
[ ok ] Starting ovsdb-server.
[ ok ] Configuring Open vSwitch system IDs.
[ ok ] Starting ovs-vswitchd.
[ ok ] Starting ovs-brcompatd.

[email protected]:~$ /etc/init.d/openvswitch-switch status
ovsdb-server is running with pid 6281
ovs-vswitchd is running with pid 6332
ovs-brcompatd is running with pid 6357

And that’s it! You now have a working openvswitch installation upon which you can do all the usual things you did with bridge-utils, and so much more.

leap seconds and Linux

On June 30, 2012 a leap second was inserted into UTC which caused a fair amount of difficulty for companies across the Internet. Some explanation of leap seconds, the problems with it that exist in the Linux kernel, and solutions to it follows.

What are leap seconds?

A leap second is a one second adjustment that is applied to UTC in order to prevent it from deviating more than 0.9 seconds from UT1 (mean solar time). It can be positive or negative and is implemented by adding 23:59:60 or skipping 23:59:59 on the last day of a given month (usually June 30 or December 31). Since the UTC standard was established in 1972, however, 25 leap seconds have been scheduled and all of them have been positive.

Since they are dependent on climatic and geologic events that affect the Earths moment of inertia (mostly tidal friction), leap seconds are irregularly spaced and unpredictable. The International Earth Rotation and Reference Systems Service (IERS) is responsible for deciding when leap seconds will occur, and announces them about six months in advance. The most recent leap second was inserted on June 30, 2012 at 23:59:60 UTC. It has been announced that there will not be a leap second on December 31, 2012.

What problems do leap seconds cause?

Leap seconds are problematic in computing for a number of reasons. As an example, to compute the elapsed seconds between two UTC dates in the past requires a table of leap seconds which must be updated whenever one is announced. It is also impossible to calculate accurate time intervals for UTC dates farther in the future than the interval of leap second announcements. There are more practical problems dealing with distributed systems that depend on accurate time stamping of series data.

In particular, there have been problems with the implementation of leap second handling in the Linux kernel itself. When the last leap second occurred on June 30, 2012 this caused outages at reddit (Apache Cassandra), Mozilla (Hadoop), Qantas Airlines, and other sites. Generally speaking, leap second problems on Linux hosts are characterized by high CPU usage of certain processes immediately after application of a leap second to the local clock.

In one particular case, tgtd (scsi-target-utils) on CentOS 6 hosts began generating an average 14,000 log messages per second:

Jun 30 23:59:59 host kernel: Clock: inserting leap second 23:59:60 UTC
Jun 30 23:59:59 host tgtd: work_timer_evt_handler(89) failed to read from timerfd, Resource temporarily unavailable
Jun 30 23:59:59 host tgtd: work_timer_evt_handler(89) failed to read from timerfd, Resource temporarily unavailable
Jun 30 23:59:59 host tgtd: work_timer_evt_handler(89) failed to read from timerfd, Resource temporarily unavailable

This caused the root file system of approximately 600 hosts to become full before the issue was mitigated.

Why do these problems occur?

The last leap second exposed a kernel bug that can affect any threaded application. It is most apparent with applications that use sub-second CLOCK_REALTIME timeouts in a loop, usually connected with futexes.

On July 3, 2007 commit 746976a301ac9c9aa10d7d42454f8d6cdad8ff2b (2.6.22) removed clock_was_set() in seconds_overflow() to prevent a deadlock. Due to this patch the following occurs when a leap second is added to UTC:

  • The leap second occurs and CLOCK_REALTIME is set back by one second
  • clock_was_set() is not called by seconds_overflow() so the hrtimer base.offset value for CLOCK_REALTIME is not updated
  • CLOCK_REALTIME’s sense of wall time is now one second ahead of the timekeeping core’s
  • At interrupt time, hrtimer code expires all CLOCK_REALTIME timers that are set for ($interrupt_time + 1 second) and before

At this point all TIMER_ABSTIME CLOCK_REALTIME timers now expire one second early. Even worse, all sub-second TIMER_ABSTIME CLOCK_REALTIME timers will return immediately. Any applications that use such timer calls in a loop will experience load spikes. This situation persists until clock_was_set() is called, for example, via settimeofday().

On July 13, 2012, Linus merged several commits in d55e5bd0201a2af0182687882a92c5f95dbccc12 (3.5-rc7) which, beyond simply providing clock_was_set_delayed() in hrtimer to resolve the problem, included other rework of hrtimer and timekeeping.

Affected Kernels

This problem has existed since kernel 2.6.22. All kernels from 2.6.22 to 3.5-rc7 are presumably affected. All RHEL 5.x kernels already include a patch to avoid this bug. Unfortunately, Red Hat either neglected to patch, or mispatched, RHEL 6 for the same issue. All RHEL 6 kernels are vulnerable to this problem with patches available in the following updates;

  • RHEL 6.3: kernel-2.6.32-279.5.2
  • RHEL 6.2 Extended Updates: kernel-2.6.32-220.25.1.el6
  • RHEL 6.1 Extended Updates: kernel-2.6.32-131.30.2

In Debian and it’s derivatives this issue is patched in the following kernel updates;

  • Debian 6.x (squeeze): linux-image-2.6.32-46
  • Debian 7.x (wheezy): linux-image-3.2.29-1

Resolution

Quite obviously, the most prudent fix is to apply a patched kernel package to the affected host, or upgrade to upstream > 3.5-rc7. If a given host cannot be patched, it is possible to manually call settimeofday() after a leap second is applied by issuing either of the following;

date -s "`LC_ALL=C date`"
date `date +'%m%d%H%M%C%y.%S'`

Doing so will resolve any present issues on the host in question.

Another interesting approach to solving this problem was devised by Google, which they call “Leap Smear”. Since Google run their own stratum 2 NTP servers they patched NTP to not issue LI (leap indicator) and instead “smear” a leap second by modulating ‘lie’ over time window w before midnight;

lie(t) = (1.0 – cos(pi * t / w)) / 2.0

You can read more about the leap smear technique at their blog.