Scalable DNS with AnyCast

A while back I was faced with a problem. The existing recursive DNS infrastructure in the datacenters were built on a traditional, common scaling design. We had a handful of DNS servers that were spread across the datacenters and stuck behind different flavors of load balancing. There were six or seven different resolver IP’s that hosts were pointed at depending on various factors. They all ran ISC BIND. Some used Cisco load balancers, some Zeus, some ldirectord. These aren’t necessarily bad solutions to the problem of scaling and high availability but we were running into problems when reaching the 40,000 query per second range. Interrupt driven kernels will experience live locking without some kind of trickery to avoid it. Load balancers are expensive and can be a single point of failure if you don’t double up hardware and so forth. This whole setup was complicated, expensive, and didn’t scale as easily as it could.

There had to be a better way.

After some research and design discussions I came up with a pretty elegant solution that solved all the above problems and was radically cheaper and simpler.

The Solution
The final implementation used a simple, and at this point well proven, technology: AnyCast. At the time AnyCast was fairly new on the infrastructure scene but it had a number of advantages. First of all it’s simple. You need no special hardware beyond the layer 3 switches you probably already have. The implementation is just a few lines in IOS and you’re up and running with a route and SLA check. Since the switches handle all the load balancing you can get rid of all that expensive load balancing gear with its added complexity.

Now that you have no load balancers to worry about you can just throw cheap, entry level nodes around every datacenter and point the AnyCast routes to them. In our case we just used cheap dual core boxes with 2GB of RAM each. Nothing special. This is horizontal scaling at it’s finest.

The final trick was to get rid of ISC BIND and replace it with unbound. Now, don’t get me wrong, ISC BIND works great and we could have continued to use it. There were a couple considerations that drove the decision for unbound however. First of all it performs nearly an order of magnitude better on the same hardware. Second, it does one thing and does it well – recursive queries and caching. Because of that its configuration is much simpler as well.

The deployment today consists of 16 AnyCast endpoints that are servicing an aggregate load of about 80,000 queries per second and could easily support much more than that. Initial performance testing showed that those cheap dual core hosts can support a query load of about 20,000 queries per second each.

A nice, clean setup that is simple and cheap. Perfect.

Design Considerations
There are a few things to be aware of when designing a system like this however.

  • CEF and XOR : Cisco gear has to make a decision on where to route inbound queries when there are multiple endpoints that have identical route distances. This ends up being a pseudo-load balancing in practice because it is not round robin. The switch decides where to route packets by XOR’ing the last two octets of the source and destination IP’s. The balance of traffic across a pool of endpoints ends up being pretty close to even but it’s not perfect. You have to be aware of this slight traffic imbalance when capacity planning.
  • Number of endpoints : CEF on Cisco devices currently only supports a maximum of 16 endpoints per device. In practice this isnt a practical problem though. It’s just something to be aware of.
  • More general capacity planning : If you were to ever lose the route to a switch, however unlikely, AnyCast will fail ALL the traffic destined to those endpoints to the next lowest cost route. If you dont plan for that you’ll send too much traffic to the next cluster of nodes which will DoS it and make the SLA checks fail. AnyCast will then send all the original traffic, and all the traffic for cluster number 2, on to the third cluster and so forth. Cascading failure of your whole infrastructure can happen.
  • Troubleshooting : It’s somewhat more complicated to know where traffic from a given DNS query is being routed. You have to dig around a lot to figure this out if there is a problem. It’s not impossible… just not as straightforward as designs that have a single cluster with a single virtual IP taking in all the inbound queries.

Beyond those few considerations though, a setup like this is quite reliable, endlessly scalable, and offers the ability to have a single DNS query target for all hosts across all datacenters.

Pretty nice.

Building scalable metric collection

The Problem

Say you have thousands of hosts and want to collect tens of metrics from each one for analysis and alerting. Tools like cacti and munin simply dont scale up these levels. The very paradigm they operate under (a centralized database and data pull) is inherently flawed when working with these kinds of data sets. Furthermore, they are fairly inflexible when considering the almost daily changing requirements of engineers and developers. Generating customized graphs for one-off views of interesting metrics is difficult at best.

At my employer we currently monitor about 18,000 hosts and the number is constantly growing. Centralized polling systems like cacti and munin are in use but only on subsets of hosts for the very reasons already stated. Try plugging 10,000 devices into cacti and it will die in a fire pretty quickly no matter how good the hardware is. Some modest numbers:

18000 hosts
10 metrics per host
10 second collection interval
13 months retention
------------------------------
606,528,000,000 data points

Approximately six hundred billion data points to store, index, search, and somehow render on demand. Using RRD type databases (as cacti and other tools do) you can get that number way down if you want to sacrifice granularity on older data points. Lets assume that will decrease our data set by two orders of magnitude. Thats still 6,000,000,000 data points. No small challenge.

The Goal

The ideal statistics collection system would be completely distributed. Independent collection agents run on each host using data push, or dedicated hosts use data pull where that isnt possible (SNMP devices). Those agents send metrics into a data store that has no central controller, is distributed with no single point of failure, replicates and mirrors seamlessly, and scales linearly. To generate graphs the ideal tool would also work in a distributed fashion with no single point of failure and use a robust API to render graphs or serve up other data formats (JSON, CSV, etc.) on demand.

The Solution

There are many, many tools out there in the Open Source world for doing one or more of these things. Some scale, some dont, they have varying levels of maturity, and are written in a wide array and languages. You know the story.

All of the tools you’ll find in the Open Source world fall into one or more of the following five categories;

    Collection
    Transport
    Processing
    Storage
    Presentation

Collection
collectd is particularly well suited to the role of collection. It is written in quite clean and well designed C. It uses a data push model and everything is done through plugins. There are hooks for perl, python, and other languages. As an added bonus it can push nagios alerts too.

Transport
While collectd is perfectly capable of sending data to multicast addresses or even relational databases that isnt a good fit for this problem. The primary concern is that a pool of hosts running carbon-cache may or may not receive the data. This creates consistency issues across the data store. While pushing the data into, say, cassandra would be pretty elegant here there is nothing readily written to do that. It’s also an open question whether the chosen front end can interface well with the cassandra DB. A more straightforward solution in this case would be to use a message queuing system like RabbitMQ. All the collectd daemons push their data to RabbitMQ which then decides which carbon-cache hosts to push given metrics to. You can then make your own decisions on replication logic, datacenter boundaries, etc.

Storage
There is a great application stack out there that handles storing and retrieving metrics from collectd called graphite. The storage part of that stack is handled by carbon and whisper. Carbon collects inbound data and writes it to whisper databases which are quite similar in structure to RRD. While carbon has mechanisms to handle replication and sharding built into it, using a message queue is a more elegant solution and offers some interesting advantages for data pool consistency.

Presentation
Graphite is the obvious choice here. It requires no pre-existing knowledge of the data sets it is going to graph. It has a simple URL API for graph rendering or retrieving data via raw, JSON, CSV, etc. It also allows you to apply useful functions against the rendered data set.

Conclusions

Using the above stack provides a linearly scaleable, cross-datacenter solution to collecting, storing, and demand-fetching very large numbers of metrics for any operational use. A pilot installation is being turned up as I write this. I will come back with updates and more detailed information if things deviate greatly.

Other Interesting Tools
OpenTSDB
reconnoiter
esper
d3
statsd
hadoop
hbase