Debian 7 on the Samsung Series 9 Ultrabook

I recently purchased an upgrade to my aging laptop; a SAMSUNG Series 9 NP900X3C-A01US 13″ Ultrabook. I wont go too much into aesthetics except to say that this laptop is everything the reviews say it is. It’s light, sturdy, stylish, fast, and sips power. It is, almost down to the PCB, Samsungs answer to the 13″ Macbook Air. I am happier with it so far than I have been with any laptop I’ve owned… and I’ve owned quite a few.

At any rate, throwing Debian 7.0 (wheezy) on this laptop was trivial and almost everything “just works”. There are a few things I had to tweak as far as power saving, function keys, etc. and I wanted to outline those things here. Implement the items below to get the most out of yours if you own one.

Use the latest kernel
I am running 3.7.4 from kernel.org on this ultrabook. Always use the latest available stable kernel on laptops. This is doubly true on very new ones like the series 9 if you want all the hardware to be well supported. Some hardware wont work under the default wheezy kernel on this model. There are also continual improvements in power management happening in the kernel. One example of something that didnt work properly under the default wheezy kernel was detecting when the lid was closed.

Use tmpfs
Debian doesn’t yet default to putting some things on tmpfs that should be. In /etc/default/tmpfs set RAMTMP=yes to mount /tmp on tmpfs. I also like to add an entry to /etc/fstab to mount /home/someuser/.cache/google-chrome on tmpfs as well. Both of these things speed up access to temporary/cache data and help to save power.

tmpfs /home/someuser/.cache/google-chrome tmpfs mode=1777,noatime 0 0

Enable discard support
This laptop comes with a 128GB SanDisk SSD U100. If your SSD supports TRIM (and this one does) and you are using ext4 (and you should be!) you can enable TRIM support in the file system by adding ‘discard’ to all the mount points in /etc/fstab.

/dev/mapper/lvm-root / ext4 discard,errors=remount-ro 0 1

If, as in the example above, you are also using LVM then you should configure it to issue discards to the underlying physical volume. To do so, set “issue_discards=1” in /etc/lvm/lvm.conf.

Use NOOP scheduler
Schedulers are getting smarter these days so this might not be necessary any more. I am still in the habit of setting noop as the scheduler for non-rotational storage devices though. I like to add a udev rule that will set noop if the device advertises itself as non-rotational. You could just set the default elevator to noop but this would effect, say, a USB SATA disk that you may plug in some day.

cat > /etc/udev/rules.d/60-schedulers.rules << EOF # set noop scheduler for non-rotating disks ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="noop" EOF

i915 power saving
The i915 kernel module for the Intel HD 4000 graphics chip set supports some extra power saving options that you can take advantage of. To enable them, add the following to /etc/default/grub, in the same spot where the "quiet" option for grub currently exists.

GRUB_CMDLINE_LINUX_DEFAULT="quiet i915.i915_enable_rc6=1 i915.i915_enable_fbc=1 i915.lvds_downclock=1"

To see more information about what those options do, execute "modinfo i915".

Disable onboard LAN
This is a truly portable notebook. You shouldnt generally be using the onboard LAN a lot. You can save some power by disabling it in the BIOS.

Extend battery life
This laptop has such good battery life that you should be able to live with it quite comfortably in "battery extender" mode. This mode only lets the battery charge up to 80% and greatly extends the useful life of the battery. Enable it in the BIOS.

If you are in a situation where you know you're going to need maximum battery life (say, while waiting to board a very long flight) you can disable battery extender mode via a file in /sys. Letting the battery charge to 100% should give you about another hour of run time.

echo 0 > /sys/devices/platform/samsung/battery_life_extender

Enable touchpad tapping
Xorg uses the wrong driver for the touchpad by default. If you want to enable tap / doubletap / etc. then you'll need to touch a config file for Xorg.

mkdir -p /etc/X11/xorg.conf.d
cat > /etc/X11/xorg.conf.d/50-snaptics.conf << EOF Section "InputClass" Identifier "touchpad" Driver "synaptics" MatchIsTouchpad "on" Option "TapButton1" "1" Option "TapButton2" "2" Option "TapButton3" "3" #Option "VertEdgeScroll" "on" #Option "VertTwoFingerScroll" "on" #Option "HorizEdgeScroll" "on" #Option "HorizTwoFingerScroll" "on" #Option "CircularScrolling" "on" #Option "CircScrollTrigger" "2" #Option "EmulateTwoFingerMinZ" "40" #Option "EmulateTwoFingerMinW" "8" #Option "CoastingSpeed" "0" EndSection EOF

Coming Soon...

Enable silent mode binding
omething here about binding Fn-F11 to enable/disable silent mode.

Enable keyboard backlight bindings
Something here about enabling backlight keys Fn-F9 and Fn-F10

Enable wifi binding
Something here about Fn-F12

Turn off bluetooth radio by default
Related to the above, but only turn off bluetooth radio during boot up

Use Powertop
Something here about enabling powertops tunables on boot up

Deciphering Linux page allocation failures

I find myself diving into Linux kernel memory management more and more these days. I thought I’d write up some helpful tips on decoding something you might see every once in a while; page allocation failures. In this particular case, we’ll look at the following example:

Dec  6 04:30:13 host kernel: echo: page allocation failure. order:9, mode:0xd0

What you see here is the following:

  • Dec 6 04:30:13 time stamp
  • host host name
  • kernel: the process that generated the message. In this case, it was the kernel itself
  • echo: the command that cause the message to be generated
  • page allocation failure. the message itself
  • order:9, the number of pages that were requested, as a power of 2
  • mode:0xd0 flags passed to the kernel memory allocator.

Regarding “order:9”, the kernel allocates pages in powers of 2. order:9 simply means it requested 2^9 pages (512), of whatever size they are. To see the size of your memory pages you can issue:

getconf PAGESIZE

In the case of this host, memory pages are 4096 bytes so the kernel was attempting to allocate 2097152 bytes (2MB). “mode:0xd0” is the flag passed to the kernel memory allocator. You can find all possible modes in include/linux/gfp.h.

“echo” caused the page allocation failure, which lead this call trace:

Dec  6 04:30:13 host kernel: Call Trace:
Dec  6 04:30:13 host kernel:  [<ffffffff8020f895>] __alloc_pages+0x2b5/0x2ce
Dec  6 04:30:13 host kernel:  [<ffffffff80212dac>] may_open+0x65/0x22f
Dec  6 04:30:13 host kernel:  [<ffffffff8023def3>] __get_free_pages+0x30/0x69
Dec  6 04:30:13 host kernel:  [<ffffffff884a9aa0>] :ip_conntrack:alloc_hashtable+0x33/0x7a
Dec  6 04:30:13 host kernel:  [<ffffffff884a9b56>] :ip_conntrack:set_hashsize+0x49/0x12a
Dec  6 04:30:13 host kernel:  [<ffffffff8029a32b>] param_attr_store+0x1a/0x29
Dec  6 04:30:13 host kernel:  [<ffffffff8029a37f>] module_attr_store+0x21/0x25
Dec  6 04:30:13 host kernel:  [<ffffffff802fdc83>] sysfs_write_file+0xb9/0xe8
Dec  6 04:30:13 host kernel:  [<ffffffff802171a7>] vfs_write+0xce/0x174
Dec  6 04:30:13 host kernel:  [<ffffffff802179df>] sys_write+0x45/0x6e
Dec  6 04:30:13 host kernel:  [<ffffffff80260106>] system_call+0x86/0x8b
Dec  6 04:30:13 host kernel:  [<ffffffff80260080>] system_call+0x0/0x8b

The first line is where the kernel failed, in alloc_pages, which is no surprise. As we go a bit deeper in the stack trace you can see that the calling function was :ip_conntrack:alloc_hashtable, so we died during an attempt to allocate 2MB to the ip_conntrack hash table.

After the above, the kernel dumps a fair amount of information (Mem-info) about the memory state of the host. If you’re interested in the kernel code involved, see show_mem() in lib/show_mem.c, and show_free_areas() in mm/page_alloc.c.

Dec  6 04:30:13 host kernel: Mem-info:
Dec  6 04:30:13 host kernel: DMA per-cpu:
Dec  6 04:30:13 host kernel: cpu 0 hot: high 186, batch 31 used:32
Dec  6 04:30:13 host kernel: cpu 0 cold: high 62, batch 15 used:57
Dec  6 04:30:13 host kernel: cpu 1 hot: high 186, batch 31 used:96
Dec  6 04:30:13 host kernel: cpu 1 cold: high 62, batch 15 used:11
Dec  6 04:30:13 host kernel: cpu 2 hot: high 186, batch 31 used:90
Dec  6 04:30:13 host kernel: cpu 2 cold: high 62, batch 15 used:53
Dec  6 04:30:13 host kernel: cpu 3 hot: high 186, batch 31 used:102
Dec  6 04:30:13 host kernel: cpu 3 cold: high 62, batch 15 used:7
Dec  6 04:30:13 host kernel: cpu 4 hot: high 186, batch 31 used:136
Dec  6 04:30:13 host kernel: cpu 4 cold: high 62, batch 15 used:14
Dec  6 04:30:13 host kernel: cpu 5 hot: high 186, batch 31 used:39
Dec  6 04:30:13 host kernel: cpu 5 cold: high 62, batch 15 used:3
Dec  6 04:30:13 host kernel: cpu 6 hot: high 186, batch 31 used:163
Dec  6 04:30:13 host kernel: cpu 6 cold: high 62, batch 15 used:12
Dec  6 04:30:13 host kernel: cpu 7 hot: high 186, batch 31 used:74
Dec  6 04:30:13 host kernel: cpu 7 cold: high 62, batch 15 used:0
Dec  6 04:30:13 host kernel: DMA32 per-cpu: empty
Dec  6 04:30:13 host kernel: Normal per-cpu: empty
Dec  6 04:30:13 host kernel: HighMem per-cpu: empty
Dec  6 04:30:13 host kernel: Free pages:       19348kB (0kB HighMem)
Dec  6 04:30:13 host kernel: Active:67877 inactive:18034 dirty:110 writeback:0 unstable:0 free:5017 slab:11928 mapped-file:3125 mapped-anon:28202 pagetables:854
Dec  6 04:30:13 host kernel: DMA free:21268kB min:2916kB low:3644kB high:4372kB active:271488kB inactive:70236kB present:532480kB pages_scanned:35 all_unreclaimable? no
Dec  6 04:30:13 host kernel: lowmem_reserve[]: 0 0 0 0
Dec  6 04:30:13 host kernel: DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
Dec  6 04:30:13 host kernel: lowmem_reserve[]: 0 0 0 0
Dec  6 04:30:13 host kernel: Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
Dec  6 04:30:13 host kernel: lowmem_reserve[]: 0 0 0 0
Dec  6 04:30:13 host kernel: HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
Dec  6 04:30:13 host kernel: lowmem_reserve[]: 0 0 0 0
Dec  6 04:30:13 host kernel: DMA: 2090*4kB 1339*8kB 99*16kB 17*32kB 5*64kB 5*128kB 0*256kB 2*512kB 1*1024kB 0*2048kB 0*4096kB = 24208kB
Dec  6 04:30:13 host kernel: DMA32: empty
Dec  6 04:30:13 host kernel: Normal: empty
Dec  6 04:30:13 host kernel: HighMem: empty
Dec  6 04:30:13 host kernel: 61160 pagecache pages
Dec  6 04:30:13 host kernel: Swap cache: add 428356, delete 421485, find 49760550/49818299, race 0+197
Dec  6 04:30:13 host kernel: Free swap  = 1986220kB
Dec  6 04:30:13 host kernel: Total swap = 2096472kB
Dec  6 04:30:13 host kernel: Free swap:       1986220kB
Dec  6 04:30:13 host kernel: 133120 pages of RAM
Dec  6 04:30:13 host kernel: 22508 reserved pages
Dec  6 04:30:13 host kernel: 42739 pages shared
Dec  6 04:30:13 host kernel: 6835 pages swap cached

Finally we see a message about falling back to vmalloc.

Dec  6 04:30:13 host kernel: ip_conntrack: falling back to vmalloc.

Since the kernel attempts to kmalloc() a contiguous block of memory, and fails, it falls back to vmalloc() which can allocate non-contiguous blocks of memory.

Linux KVM: Openvswitch on Debian Wheezy

Among a great many other things, openvswitch is an alternative to managing your virtual networking stacks for KVM with bridge-utils. It supports VLANs, LACP, QoS, sFlow, and so forth.  Listed below are the steps required to get openvswitch running on Debian 7.0 (wheezy).

This article is written with the presumption that you are running a source-installed kernel (3.6.6 with the openvswitch module in this case), and want to use the latest openvswitch from git.

Install prerequisites

Apply any available updates, get all the build dependencies for openvswitch, and install module-assistant.

apt-get update && apt-get dist-upgrade
apt-get install build-essential
apt-get build-dep openvswitch
apt-get install module-assistant

Prep your environment

bridge-utils has a kernel modules that conflicts with the brcompat module in openvswitch. Lets remove that and at the same time stop libvirt and KVM for a bit.

apt-get remove --purge bridge-utils
/etc/init.d/libvirt-bin stop
/etc/init.d/qemu-kvm stop

Build openvswitch

Clone the openvswitch git repo and build debian packages from it.

git clone git://openvswitch.org/openvswitch
cd openvswitch
dpkg-buildpackage -b

Install the packages you just built.

cd ../
dpkg -i openvswitch-switch_1.9.90-1_amd64.deb openvswitch-common_1.9.90-1_amd64.deb \
openvswitch-brcompat_1.9.90-1_amd64.deb openvswitch-datapath-source_1.9.90-1_all.deb \
openvswitch-controller_1.9.90-1_amd64.deb openvswitch-pki_1.9.90-1_all.deb

Build openvswitch-datapath for your running kernel.

module-assistant auto-install openvswitch-datapath

Configure brcompat to load on startup.

sed -i 's/# BRCOMPAT=no/BRCOMPAT=yes/' /etc/default/openvswitch-switch

Verify your configuration

At this point you should reboot and verify that the proper modules are loaded, the service starts normally, and the status output is correct.

[email protected]:~$ lsmod | grep brcompat
brcompat               12982  0 
openvswitch            73431  1 brcompat

[email protected]:~$ /etc/init.d/openvswitch-switch restart
[ ok ] Killing ovs-brcompatd (5439).
[ ok ] Killing ovs-vswitchd (5414).
[ ok ] Killing ovsdb-server (5363).
[ ok ] Starting ovsdb-server.
[ ok ] Configuring Open vSwitch system IDs.
[ ok ] Starting ovs-vswitchd.
[ ok ] Starting ovs-brcompatd.

[email protected]:~$ /etc/init.d/openvswitch-switch status
ovsdb-server is running with pid 6281
ovs-vswitchd is running with pid 6332
ovs-brcompatd is running with pid 6357

And that’s it! You now have a working openvswitch installation upon which you can do all the usual things you did with bridge-utils, and so much more.

leap seconds and Linux

On June 30, 2012 a leap second was inserted into UTC which caused a fair amount of difficulty for companies across the Internet. Some explanation of leap seconds, the problems with it that exist in the Linux kernel, and solutions to it follows.

What are leap seconds?

A leap second is a one second adjustment that is applied to UTC in order to prevent it from deviating more than 0.9 seconds from UT1 (mean solar time). It can be positive or negative and is implemented by adding 23:59:60 or skipping 23:59:59 on the last day of a given month (usually June 30 or December 31). Since the UTC standard was established in 1972, however, 25 leap seconds have been scheduled and all of them have been positive.

Since they are dependent on climatic and geologic events that affect the Earths moment of inertia (mostly tidal friction), leap seconds are irregularly spaced and unpredictable. The International Earth Rotation and Reference Systems Service (IERS) is responsible for deciding when leap seconds will occur, and announces them about six months in advance. The most recent leap second was inserted on June 30, 2012 at 23:59:60 UTC. It has been announced that there will not be a leap second on December 31, 2012.

What problems do leap seconds cause?

Leap seconds are problematic in computing for a number of reasons. As an example, to compute the elapsed seconds between two UTC dates in the past requires a table of leap seconds which must be updated whenever one is announced. It is also impossible to calculate accurate time intervals for UTC dates farther in the future than the interval of leap second announcements. There are more practical problems dealing with distributed systems that depend on accurate time stamping of series data.

In particular, there have been problems with the implementation of leap second handling in the Linux kernel itself. When the last leap second occurred on June 30, 2012 this caused outages at reddit (Apache Cassandra), Mozilla (Hadoop), Qantas Airlines, and other sites. Generally speaking, leap second problems on Linux hosts are characterized by high CPU usage of certain processes immediately after application of a leap second to the local clock.

In one particular case, tgtd (scsi-target-utils) on CentOS 6 hosts began generating an average 14,000 log messages per second:

Jun 30 23:59:59 host kernel: Clock: inserting leap second 23:59:60 UTC
Jun 30 23:59:59 host tgtd: work_timer_evt_handler(89) failed to read from timerfd, Resource temporarily unavailable
Jun 30 23:59:59 host tgtd: work_timer_evt_handler(89) failed to read from timerfd, Resource temporarily unavailable
Jun 30 23:59:59 host tgtd: work_timer_evt_handler(89) failed to read from timerfd, Resource temporarily unavailable

This caused the root file system of approximately 600 hosts to become full before the issue was mitigated.

Why do these problems occur?

The last leap second exposed a kernel bug that can affect any threaded application. It is most apparent with applications that use sub-second CLOCK_REALTIME timeouts in a loop, usually connected with futexes.

On July 3, 2007 commit 746976a301ac9c9aa10d7d42454f8d6cdad8ff2b (2.6.22) removed clock_was_set() in seconds_overflow() to prevent a deadlock. Due to this patch the following occurs when a leap second is added to UTC:

  • The leap second occurs and CLOCK_REALTIME is set back by one second
  • clock_was_set() is not called by seconds_overflow() so the hrtimer base.offset value for CLOCK_REALTIME is not updated
  • CLOCK_REALTIME’s sense of wall time is now one second ahead of the timekeeping core’s
  • At interrupt time, hrtimer code expires all CLOCK_REALTIME timers that are set for ($interrupt_time + 1 second) and before

At this point all TIMER_ABSTIME CLOCK_REALTIME timers now expire one second early. Even worse, all sub-second TIMER_ABSTIME CLOCK_REALTIME timers will return immediately. Any applications that use such timer calls in a loop will experience load spikes. This situation persists until clock_was_set() is called, for example, via settimeofday().

On July 13, 2012, Linus merged several commits in d55e5bd0201a2af0182687882a92c5f95dbccc12 (3.5-rc7) which, beyond simply providing clock_was_set_delayed() in hrtimer to resolve the problem, included other rework of hrtimer and timekeeping.

Affected Kernels

This problem has existed since kernel 2.6.22. All kernels from 2.6.22 to 3.5-rc7 are presumably affected. All RHEL 5.x kernels already include a patch to avoid this bug. Unfortunately, Red Hat either neglected to patch, or mispatched, RHEL 6 for the same issue. All RHEL 6 kernels are vulnerable to this problem with patches available in the following updates;

  • RHEL 6.3: kernel-2.6.32-279.5.2
  • RHEL 6.2 Extended Updates: kernel-2.6.32-220.25.1.el6
  • RHEL 6.1 Extended Updates: kernel-2.6.32-131.30.2

In Debian and it’s derivatives this issue is patched in the following kernel updates;

  • Debian 6.x (squeeze): linux-image-2.6.32-46
  • Debian 7.x (wheezy): linux-image-3.2.29-1

Resolution

Quite obviously, the most prudent fix is to apply a patched kernel package to the affected host, or upgrade to upstream > 3.5-rc7. If a given host cannot be patched, it is possible to manually call settimeofday() after a leap second is applied by issuing either of the following;

date -s "`LC_ALL=C date`"
date `date +'%m%d%H%M%C%y.%S'`

Doing so will resolve any present issues on the host in question.

Another interesting approach to solving this problem was devised by Google, which they call “Leap Smear”. Since Google run their own stratum 2 NTP servers they patched NTP to not issue LI (leap indicator) and instead “smear” a leap second by modulating ‘lie’ over time window w before midnight;

lie(t) = (1.0 – cos(pi * t / w)) / 2.0

You can read more about the leap smear technique at their blog.

Scalable DNS with AnyCast

A while back I was faced with a problem. The existing recursive DNS infrastructure in the datacenters were built on a traditional, common scaling design. We had a handful of DNS servers that were spread across the datacenters and stuck behind different flavors of load balancing. There were six or seven different resolver IP’s that hosts were pointed at depending on various factors. They all ran ISC BIND. Some used Cisco load balancers, some Zeus, some ldirectord. These aren’t necessarily bad solutions to the problem of scaling and high availability but we were running into problems when reaching the 40,000 query per second range. Interrupt driven kernels will experience live locking without some kind of trickery to avoid it. Load balancers are expensive and can be a single point of failure if you don’t double up hardware and so forth. This whole setup was complicated, expensive, and didn’t scale as easily as it could.

There had to be a better way.

After some research and design discussions I came up with a pretty elegant solution that solved all the above problems and was radically cheaper and simpler.

The Solution
The final implementation used a simple, and at this point well proven, technology: AnyCast. At the time AnyCast was fairly new on the infrastructure scene but it had a number of advantages. First of all it’s simple. You need no special hardware beyond the layer 3 switches you probably already have. The implementation is just a few lines in IOS and you’re up and running with a route and SLA check. Since the switches handle all the load balancing you can get rid of all that expensive load balancing gear with its added complexity.

Now that you have no load balancers to worry about you can just throw cheap, entry level nodes around every datacenter and point the AnyCast routes to them. In our case we just used cheap dual core boxes with 2GB of RAM each. Nothing special. This is horizontal scaling at it’s finest.

The final trick was to get rid of ISC BIND and replace it with unbound. Now, don’t get me wrong, ISC BIND works great and we could have continued to use it. There were a couple considerations that drove the decision for unbound however. First of all it performs nearly an order of magnitude better on the same hardware. Second, it does one thing and does it well – recursive queries and caching. Because of that its configuration is much simpler as well.

The deployment today consists of 16 AnyCast endpoints that are servicing an aggregate load of about 80,000 queries per second and could easily support much more than that. Initial performance testing showed that those cheap dual core hosts can support a query load of about 20,000 queries per second each.

A nice, clean setup that is simple and cheap. Perfect.

Design Considerations
There are a few things to be aware of when designing a system like this however.

  • CEF and XOR : Cisco gear has to make a decision on where to route inbound queries when there are multiple endpoints that have identical route distances. This ends up being a pseudo-load balancing in practice because it is not round robin. The switch decides where to route packets by XOR’ing the last two octets of the source and destination IP’s. The balance of traffic across a pool of endpoints ends up being pretty close to even but it’s not perfect. You have to be aware of this slight traffic imbalance when capacity planning.
  • Number of endpoints : CEF on Cisco devices currently only supports a maximum of 16 endpoints per device. In practice this isnt a practical problem though. It’s just something to be aware of.
  • More general capacity planning : If you were to ever lose the route to a switch, however unlikely, AnyCast will fail ALL the traffic destined to those endpoints to the next lowest cost route. If you dont plan for that you’ll send too much traffic to the next cluster of nodes which will DoS it and make the SLA checks fail. AnyCast will then send all the original traffic, and all the traffic for cluster number 2, on to the third cluster and so forth. Cascading failure of your whole infrastructure can happen.
  • Troubleshooting : It’s somewhat more complicated to know where traffic from a given DNS query is being routed. You have to dig around a lot to figure this out if there is a problem. It’s not impossible… just not as straightforward as designs that have a single cluster with a single virtual IP taking in all the inbound queries.

Beyond those few considerations though, a setup like this is quite reliable, endlessly scalable, and offers the ability to have a single DNS query target for all hosts across all datacenters.

Pretty nice.

Building scalable metric collection

The Problem

Say you have thousands of hosts and want to collect tens of metrics from each one for analysis and alerting. Tools like cacti and munin simply dont scale up these levels. The very paradigm they operate under (a centralized database and data pull) is inherently flawed when working with these kinds of data sets. Furthermore, they are fairly inflexible when considering the almost daily changing requirements of engineers and developers. Generating customized graphs for one-off views of interesting metrics is difficult at best.

At my employer we currently monitor about 18,000 hosts and the number is constantly growing. Centralized polling systems like cacti and munin are in use but only on subsets of hosts for the very reasons already stated. Try plugging 10,000 devices into cacti and it will die in a fire pretty quickly no matter how good the hardware is. Some modest numbers:

18000 hosts
10 metrics per host
10 second collection interval
13 months retention
------------------------------
606,528,000,000 data points

Approximately six hundred billion data points to store, index, search, and somehow render on demand. Using RRD type databases (as cacti and other tools do) you can get that number way down if you want to sacrifice granularity on older data points. Lets assume that will decrease our data set by two orders of magnitude. Thats still 6,000,000,000 data points. No small challenge.

The Goal

The ideal statistics collection system would be completely distributed. Independent collection agents run on each host using data push, or dedicated hosts use data pull where that isnt possible (SNMP devices). Those agents send metrics into a data store that has no central controller, is distributed with no single point of failure, replicates and mirrors seamlessly, and scales linearly. To generate graphs the ideal tool would also work in a distributed fashion with no single point of failure and use a robust API to render graphs or serve up other data formats (JSON, CSV, etc.) on demand.

The Solution

There are many, many tools out there in the Open Source world for doing one or more of these things. Some scale, some dont, they have varying levels of maturity, and are written in a wide array and languages. You know the story.

All of the tools you’ll find in the Open Source world fall into one or more of the following five categories;

    Collection
    Transport
    Processing
    Storage
    Presentation

Collection
collectd is particularly well suited to the role of collection. It is written in quite clean and well designed C. It uses a data push model and everything is done through plugins. There are hooks for perl, python, and other languages. As an added bonus it can push nagios alerts too.

Transport
While collectd is perfectly capable of sending data to multicast addresses or even relational databases that isnt a good fit for this problem. The primary concern is that a pool of hosts running carbon-cache may or may not receive the data. This creates consistency issues across the data store. While pushing the data into, say, cassandra would be pretty elegant here there is nothing readily written to do that. It’s also an open question whether the chosen front end can interface well with the cassandra DB. A more straightforward solution in this case would be to use a message queuing system like RabbitMQ. All the collectd daemons push their data to RabbitMQ which then decides which carbon-cache hosts to push given metrics to. You can then make your own decisions on replication logic, datacenter boundaries, etc.

Storage
There is a great application stack out there that handles storing and retrieving metrics from collectd called graphite. The storage part of that stack is handled by carbon and whisper. Carbon collects inbound data and writes it to whisper databases which are quite similar in structure to RRD. While carbon has mechanisms to handle replication and sharding built into it, using a message queue is a more elegant solution and offers some interesting advantages for data pool consistency.

Presentation
Graphite is the obvious choice here. It requires no pre-existing knowledge of the data sets it is going to graph. It has a simple URL API for graph rendering or retrieving data via raw, JSON, CSV, etc. It also allows you to apply useful functions against the rendered data set.

Conclusions

Using the above stack provides a linearly scaleable, cross-datacenter solution to collecting, storing, and demand-fetching very large numbers of metrics for any operational use. A pilot installation is being turned up as I write this. I will come back with updates and more detailed information if things deviate greatly.

Other Interesting Tools
OpenTSDB
reconnoiter
esper
d3
statsd
hadoop
hbase

Bacula 3.0.1 for Mac OS X

I wont be doing a build of Bacula for Mac OS X again any time soon, if at all, for a number of reasons;

  1. I no longer have access to Mac OS X 10.4 running on an Intel chip
  2. I am actively transitioning to BackupPC

Below youll find the method I have been using to construct all the .pkg installers youll find on this site. If you have any questions about the process please post them in the comments.

Read more

list all files with resource forks on Mac OS X

This is mostly just a note to myself. If you ever have the need to find all files that contain HFS resource forks on Mac OS X, just use this bit of find magic:

find . -type f -exec test -s {}/..namedfork/rsrc \; -print

working with initrd.img files

You may have occasion to edit the contents of an initrd.img file. If so, here is how:

Extract the contents of the image

gunzip < your-initrd.img | cpio -i --make-directories

Now make your edits and then repackage the initrd

find . | cpio -o -H newc | gzip -9 > your-new-initrd.img

install debian directly onto an AoE root filesystem

Something that just about no one out there seems to be doing (yet) is trying to install Debian directly onto network block devices. The Debian installer doesnt support it (yet), grub doesnt support it (usually), and its just generally not an easy thing to do.

Now, there are quite a few ways around this problem. You can install to a ‘real’ computer and migrate the installation to a network block device. You can use debootstrap in place of the actual Debian installation system. You can use a combination of these two methods, NFS root filesystems, TFTP hacks, etc. All of these solutions are lacking in my opinion. I want to run the ‘real’ debian installer against a network block device and boot my physical hardware using only the built in PXE booting capability of the BIOS.

Taking all these issues as a personal challenge, I’ve outlined below how to go about using the regular old Debian Lenny installer directly against an AoE block device.
Read more

RANDOM BITS FROM A LINUX ENGINEER