DNS load balancing survey results
I'm one step away from implementing network wide DNS load balancing for drop.io. Like any good researcher, I don't act until I have enough data. Here's my canonical list of references and examples on best practices.
After reading through that I went over to a shell and looked at three very popular companies that have prominent URLs.
Google
www.google.com. 434882 IN CNAME www.l.google.com.
www.l.google.com. 213 IN A 64.233.169.104
www.l.google.com. 213 IN A 64.233.169.147
www.l.google.com. 213 IN A 64.233.169.99
www.l.google.com. 213 IN A 64.233.169.103
Google has a single alias with a very long TTL, followed by four A records with a 5 minute TTL. Coincidently, 5 minutes is the length first mentioned in RFC 1794. My assumptions about this method is that systems can fall offline and an external notification services can act based on that failure either by putting a new host online in the window of the TTL or updating the DNS record with a fresh IP address. A hot standby would be excellent for this.
Yahoo!
www.yahoo.com. 162 IN CNAME www.yahoo-ht3.akadns.net.
www.yahoo-ht3.akadns.net. 11 IN A 69.147.76.15
Yahoo is much more aggressive than Google with their TTLs. They also use an alias but in this case its TTL is 5 minutes
and points to a domain that looks like it could be in a cluster of similarily named hosts. The following A record is even shorter than
the alias, 60 seconds. I can imagine that they get good distribution from their cluster of real IP addresses with a
maximum outage window of 60 seconds. There's also a margin of error in the alias, as it can be misconfigured and
updated within 5 minutes. This feels a bit over engineered to me but I can imagine two wheels, one 5 times as big as the
other spinning in unison. The big wheel makes one rotation for every 5 of the little one. But in this case the little
wheel is swapped out each time it rotates. There can be more than 5 little wheels available to be coupled to the big
one. The big wheel is rarely swapped out.
Amazon
www.amazon.com. 12 IN A 72.21.207.65
Amazon is the simplest and most aggressive of them all. A single A record with a TTL of 60 seconds. Quick, efficient.
Not too nice to the root servers but whatevs, Amazon is in the high bandwidth business and they know that bandwidth is
only getting larger, so who cares about a few hundred bytes flowing around every minute? On the backend, Amazon probably
has the most elaborate system. Their DNS server should be testing for the availability of each IP that's "on deck" in a
stack of IPs, first in first out. If that IP proves that it's ready, the DNS server updates the A record for the next
record, repeat this pattern every 60 seconds. If an IP is not ready, skip it and move to the next one. Then there must
be some way of flagging an IP as "wasted" and removing it from the stack.
In conclusion, I'm going to use the Google method. It gives the most "for free" by distributing load among a group of
IPs that have decent uptime. An outage will cause at most 5 minutes of unavailability.
Written on
2009-01-28 14:49:43 UTC