Best practices for Amazon Route 53 Application Recovery Controller - Amazon Route 53 Application Recovery Controller

Best practices for Amazon Route 53 Application Recovery Controller

We recommend the following best practices for recovery readiness and failover preparedness when you set up and use Amazon Route 53 Application Recovery Controller.

Keep purpose-built, long-lived AWS credentials secure and always accessible

In a disaster recovery (DR) scenario, keep system dependencies to a minimum by using a simple approach to accessing AWS and performing recovery tasks. Create IAM long-lived credentials specifically for DR tasks, and keep the credentials securely in an on-premises physical safe or a virtual vault, to access when needed. With IAM, you can centrally manage security credentials, such as access keys, and permissions for access to AWS resources. For non-DR tasks, we recommend that you continue to use federated access, using AWS services such as AWS Single Sign-On.

To perform failover tasks in Route 53 ARC with the recovery cluster data plane API, you can attach a Route 53 ARC IAM policy to your IAM user. To learn more, see Identity-based policy examples for Amazon Route 53 Application Recovery Controller.

Choose lower TTL values for DNS records involved in failover

For DNS records that you might need to change as part of your failover mechanism, especially records that are health checked, using lower TTL values is appropriate. Setting a TTL of 60 or 120 seconds is a common choice for this scenario.

The DNS TTL (time to live) setting tells DNS resolvers how long to cache a record before requesting a new one. When you choose a TTL, you make a trade-off between latency and reliability, and responsiveness to change. With a shorter TTL on a record, DNS resolvers notice updates to the record more quickly because the TTL specifies that they must query more frequently.

For more information, see Choosing TTL values for DNS records in Best practices for Amazon Route 53 DNS.

Bookmark or hardcode your five Regional cluster endpoints and routing control ARNs

We recommend that you keep a local copy of your Route 53 ARC Regional cluster endpoints, in bookmarks or hardcoded in automation code that you use to retry your endpoints. During a failure event, you might not be able to access some API operations, including Route 53 ARC API operations that are not hosted on the extremely reliable data plane cluster. You can list the endpoints for your Route 53 ARC clusters by using the DescribeCluster API operation.

Use the extremely reliable data plane API to list and update routing control states, not the console

Using the Route 53 ARC data plane API, view your routing controls and states with the ListRoutingControls operation and update routing control states to redirect traffic for failover with the UpdateRoutingControlState operation. You can use the AWS CLI (as in these examples) or code that you write using one of the AWS SDKs. Route 53 ARC offers extreme reliability with the API in the data plane to fail over traffic. We recommend using the API instead of changing routing control states in the AWS Management Console.

Connect to one of your Regional cluster endpoints for Route 53 ARC to use the data plane API. If the endpoint is unavailable, try connecting to another cluster endpoint.

If a safety rule blocks a routing control state update, you can bypass it to make the update and fail over traffic. For more information, see Overriding safety rules to reroute traffic.

Retry each of the five Regional cluster endpoints for updating routing control states

You can use any of the Regional cluster endpoints to get or update routing control states. We recommend that your systems be prepared to retry with each of the available endpoints. For information about using code examples with the AWS SDK, including examples for trying cluster endpoints in sequence, see Code examples for Application Recovery Controller using AWS SDKs.

Test failover with Route 53 ARC

Test failover regularly with Route 53 ARC routing control, to fail over from your primary application stack to a secondary application stack. It's important to make sure that the Route 53 ARC structures that you've added are aligned with the correct resources in your stack, and that everything works as you expect it to. You should test this after you set up Route 53 ARC for your environment, and continue to test periodically, so that your failover environment is prepared, before you experience a failure situation in which you need your secondary system to be up and running quickly to avoid downtime for your users.

Add notifications for readiness status changes

Set a rule in Amazon EventBridge to send a notification whenever a readiness check status changes, for example, from READY to NOT READY. When you receive a notification, you can investigate and address the issue, to make sure that your application and resources are ready for failover when you expect them to be.

You can set EventBridge rules to send notifications for several readiness check status changes, including for your recovery group (for your application), for a cell (such as an AWS Region), or for a readiness check for a resource set.

For more information, see Using Route 53 ARC with Amazon EventBridge.