Building scalable metric collection

The Problem

Say you have thousands of hosts and want to collect tens of metrics from each one for analysis and alerting. Tools like cacti and munin simply dont scale up these levels. The very paradigm they operate under (a centralized database and data pull) is inherently flawed when working with these kinds of data sets. Furthermore, they are fairly inflexible when considering the almost daily changing requirements of engineers and developers. Generating customized graphs for one-off views of interesting metrics is difficult at best.

At my employer we currently monitor about 18,000 hosts and the number is constantly growing. Centralized polling systems like cacti and munin are in use but only on subsets of hosts for the very reasons already stated. Try plugging 10,000 devices into cacti and it will die in a fire pretty quickly no matter how good the hardware is. Some modest numbers:

18000 hosts
10 metrics per host
10 second collection interval
13 months retention
------------------------------
606,528,000,000 data points

Approximately six hundred billion data points to store, index, search, and somehow render on demand. Using RRD type databases (as cacti and other tools do) you can get that number way down if you want to sacrifice granularity on older data points. Lets assume that will decrease our data set by two orders of magnitude. Thats still 6,000,000,000 data points. No small challenge.

The Goal

The ideal statistics collection system would be completely distributed. Independent collection agents run on each host using data push, or dedicated hosts use data pull where that isnt possible (SNMP devices). Those agents send metrics into a data store that has no central controller, is distributed with no single point of failure, replicates and mirrors seamlessly, and scales linearly. To generate graphs the ideal tool would also work in a distributed fashion with no single point of failure and use a robust API to render graphs or serve up other data formats (JSON, CSV, etc.) on demand.

The Solution

There are many, many tools out there in the Open Source world for doing one or more of these things. Some scale, some dont, they have varying levels of maturity, and are written in a wide array and languages. You know the story.

All of the tools you’ll find in the Open Source world fall into one or more of the following five categories;

    Collection
    Transport
    Processing
    Storage
    Presentation

Collection
collectd is particularly well suited to the role of collection. It is written in quite clean and well designed C. It uses a data push model and everything is done through plugins. There are hooks for perl, python, and other languages. As an added bonus it can push nagios alerts too.

Transport
While collectd is perfectly capable of sending data to multicast addresses or even relational databases that isnt a good fit for this problem. The primary concern is that a pool of hosts running carbon-cache may or may not receive the data. This creates consistency issues across the data store. While pushing the data into, say, cassandra would be pretty elegant here there is nothing readily written to do that. It’s also an open question whether the chosen front end can interface well with the cassandra DB. A more straightforward solution in this case would be to use a message queuing system like RabbitMQ. All the collectd daemons push their data to RabbitMQ which then decides which carbon-cache hosts to push given metrics to. You can then make your own decisions on replication logic, datacenter boundaries, etc.

Storage
There is a great application stack out there that handles storing and retrieving metrics from collectd called graphite. The storage part of that stack is handled by carbon and whisper. Carbon collects inbound data and writes it to whisper databases which are quite similar in structure to RRD. While carbon has mechanisms to handle replication and sharding built into it, using a message queue is a more elegant solution and offers some interesting advantages for data pool consistency.

Presentation
Graphite is the obvious choice here. It requires no pre-existing knowledge of the data sets it is going to graph. It has a simple URL API for graph rendering or retrieving data via raw, JSON, CSV, etc. It also allows you to apply useful functions against the rendered data set.

Conclusions

Using the above stack provides a linearly scaleable, cross-datacenter solution to collecting, storing, and demand-fetching very large numbers of metrics for any operational use. A pilot installation is being turned up as I write this. I will come back with updates and more detailed information if things deviate greatly.

Other Interesting Tools
OpenTSDB
reconnoiter
esper
d3
statsd
hadoop
hbase