DNS load balancing survey results

I'm one step away from implementing network wide DNS load balancing for drop.io. Like any good researcher, I don't act until I have enough data. Here's my canonical list of references and examples on best practices.

References

Examples

After reading through that I went over to a shell and looked at three very popular companies that have prominent URLs.

Google

www.google.com.         434882  IN      CNAME   www.l.google.com.
www.l.google.com.       213     IN      A       64.233.169.104
www.l.google.com.       213     IN      A       64.233.169.147
www.l.google.com.       213     IN      A       64.233.169.99
www.l.google.com.       213     IN      A       64.233.169.103

Google has a single alias with a very long TTL, followed by four A records with a 5 minute TTL. Coincidently, 5 minutes is the length first mentioned in RFC 1794. My assumptions about this method is that systems can fall offline and an external notification services can act based on that failure either by putting a new host online in the window of the TTL or updating the DNS record with a fresh IP address. A hot standby would be excellent for this.

Yahoo!

www.yahoo.com.          162     IN      CNAME   www.yahoo-ht3.akadns.net.
www.yahoo-ht3.akadns.net. 11    IN      A       69.147.76.15

Yahoo is much more aggressive than Google with their TTLs. They also use an alias but in this case its TTL is 5 minutes and points to a domain that looks like it could be in a cluster of similarily named hosts. The following A record is even shorter than the alias, 60 seconds. I can imagine that they get good distribution from their cluster of real IP addresses with a maximum outage window of 60 seconds. There's also a margin of error in the alias, as it can be misconfigured and updated within 5 minutes. This feels a bit over engineered to me but I can imagine two wheels, one 5 times as big as the other spinning in unison. The big wheel makes one rotation for every 5 of the little one. But in this case the little wheel is swapped out each time it rotates. There can be more than 5 little wheels available to be coupled to the big one. The big wheel is rarely swapped out.

Amazon

www.amazon.com.         12      IN      A       72.21.207.65

Amazon is the simplest and most aggressive of them all. A single A record with a TTL of 60 seconds. Quick, efficient. Not too nice to the root servers but whatevs, Amazon is in the high bandwidth business and they know that bandwidth is only getting larger, so who cares about a few hundred bytes flowing around every minute? On the backend, Amazon probably has the most elaborate system. Their DNS server should be testing for the availability of each IP that's "on deck" in a stack of IPs, first in first out. If that IP proves that it's ready, the DNS server updates the A record for the next record, repeat this pattern every 60 seconds. If an IP is not ready, skip it and move to the next one. Then there must be some way of flagging an IP as "wasted" and removing it from the stack.

In conclusion, I'm going to use the Google method. It gives the most "for free" by distributing load among a group of IPs that have decent uptime. An outage will cause at most 5 minutes of unavailability.


Written on 2009-01-28 14:49:43 UTC

Back

comments powered by Disqus

I am a hacker and systems architect specializing in data analytics and human computer interfaces.



Photos

Music

lazzarello's Profile Page

  • Login