Amazon Route 53
Developer Guide (API Version 2013-04-01)

How Amazon Route 53 Averts Failover Problems

The failover algorithms implemented by Amazon Route 53 are designed not only to route traffic to endpoints that are healthy, but also to avoid making disaster scenarios worse due to misconfigured health checks and applications, endpoint overloads, and partition failures.

How Amazon Route 53 Averts Cascading Failures

As a first defense against cascading failures, each request routing algorithm (weighted, latency, geolocation routing, and failover) has a mode of last resort. In this special mode, when all records are considered unhealthy, the Amazon Route 53 algorithm reverts to considering all records healthy.

For example, if all instances of an application, on several hosts, are rejecting health check requests, Amazon Route 53 DNS servers will choose an answer anyway and return it rather than returning no DNS answer or returning an NXDOMAIN (non-existent domain) response. An application can respond to users but still fail health checks, so this provides some protection against misconfiguration.

Similarly, if an application is overloaded, and one out of three endpoints fails its health checks, so that it's excluded from Amazon Route 53 DNS responses, Amazon Route 53 distributes responses between the two remaining endpoints. If the remaining endpoints are unable to handle the additional load and they fail, Amazon Route 53 reverts to distributing requests to all three endpoints.

How Amazon Route 53 Handles Internet Partitions

Although uncommon, there are occasionally significant internet partitions, meaning that large geographic regions can't communicate with one another over the internet. During these partitions, Amazon Route 53 locations might reach different conclusions about the health status of an endpoint and might differ from the status reported to CloudWatch. Amazon Route 53 health checkers in each AWS region are constantly sending health check statuses to all Amazon Route 53 locations. During internet partitions, each Amazon Route 53 location might have access only to a partial set of these statuses, usually from its closest regions.

For example, during an internet partition that affects connectivity to and from South America, the Amazon Route 53 DNS servers in the Amazon Route 53 South America (São Paulo) location might have good access to the health check endpoints in the South America (São Paulo) AWS region, but poor access to endpoints elsewhere. At the same time, Amazon Route 53 in US East (Ohio) might have poor access to health check endpoints in the South America (São Paulo) region, and conclude that the corresponding records are unhealthy.

Partitions such as these can give rise to situations where Amazon Route 53 locations make different conclusions about the health status of endpoints, based on their local visibility of those endpoints. This is why each Amazon Route 53 location considers an endpoint healthy when only a portion of reachable health checkers consider it healthy.