I noticed that Google talk does an interesting type of DNS load balancing. It goes like this:
talk.google.com 243885 CNAME talk.l.google.com
talk.l.google.com 300 A ip.address
This means that the talk.google.com domain takes a very long time to expire from caches, whereas the talk.l.google.com does not. 5 minutes in fact.
Now, let's say we have five erlang nodes, named A, B, C, D, and E. They all have IP addresses. I want to make all users connect to talk.foo.com and get some kind of decent distribution among the five nodes. If I use the Google talk method users that connect will get passed to a different server in five minute intervals. So: userA -> talk.foo.com -> a.foo.com (ip.address.first). Five minutes later userB -> talk.foo.com -> a.foo.com(ip.address.second). Brilliant! This will require a script that's controlling a DNS server which I own, since I'll be updateing a.foo.com's IP address every 5 minutes, though a more intelligent algorithm can be used to trigger an IP update. There's no need to load balance if the load isn't there to balance. In the case of Jabber, this can be easy to detect since each new user on the system has a persistant connection. They quite literally are "logged in".
So what if node B goes down? Well, all the users connected to that node will get disconnected, no doubt about it. But depending on how far in the 300 second (5 minute) count down we are they can reconnect and pick right back up where they left off. Then I can remove a.foo.com(ip.address.second) from my distribution script until I fix that node. This pattern is not high availability, but it would be possible to put a HA system on all of these IPs to make sure no node outage would ever do anything more than interrupt a persistent connection, at worst.
Written on
2009-01-23 19:55:41 UTC