Best practices for Amazon Route 53 Application Recovery Controller - Amazon Route 53 Application Recovery Controller

Best practices for Amazon Route 53 Application Recovery Controller

Zonal shift is in preview release for Amazon Route 53 Application Recovery Controller and is subject to change.

To minimize disruption and help provide for operational continuity, follow best practices to plan for and execute disaster recovery with Amazon Route 53 Application Recovery Controller. Review the guidelines in this chapter to learn more.

Best practices for recovery in Route 53 ARC

We recommend the following best practices for recovery and failover preparedness in Amazon Route 53 Application Recovery Controller.

Keep purpose-built, long-lived AWS credentials secure and always accessible

In a disaster recovery (DR) scenario, keep system dependencies to a minimum by using a simple approach to accessing AWS and performing recovery tasks. Create IAM long-lived credentials specifically for DR tasks, and keep the credentials securely in an on-premises physical safe or a virtual vault, to access when needed. With IAM, you can centrally manage security credentials, such as access keys, and permissions for access to AWS resources. For non-DR tasks, we recommend that you continue to use federated access, using AWS services such as AWS Single Sign-On.

To perform failover tasks in Route 53 ARC with the recovery cluster data plane API, you can attach a Route 53 ARC IAM policy to your IAM user. To learn more, see Identity-based policy examples for Amazon Route 53 Application Recovery Controller.

Choose lower TTL values for DNS records involved in failover

For DNS records that you might need to change as part of your failover mechanism, especially records that are health checked, using lower TTL values is appropriate. Setting a TTL of 60 or 120 seconds is a common choice for this scenario.

The DNS TTL (time to live) setting tells DNS resolvers how long to cache a record before requesting a new one. When you choose a TTL, you make a trade-off between latency and reliability, and responsiveness to change. With a shorter TTL on a record, DNS resolvers notice updates to the record more quickly because the TTL specifies that they must query more frequently.

For more information, see Choosing TTL values for DNS records in Best practices for Amazon Route 53 DNS.

Best practices for zonal shifts in Route 53 ARC

We recommend the following best practices for using zonal shifts for multi-AZ recovery in Route 53 ARC. Zonal shifts typically remove capacity from a live application, so it's important to be careful when you use them in production.

Capacity planning and pre-scaling

Ensure that you have planned for, and either pre-scaled or can auto-scale, sufficient capacity to accommodate the extra load imposed on Availability Zones when you start a zonal shift. With a recovery-oriented architecture, a typical recommendation is to pre-scale compute capacity to include enough headroom to serve your peak traffic when one of your (typically) three replicas is offline.

When you start a zonal shift for a single load balancer resource, for example, the capacity of one Availability Zone is temporarily removed from behind the load balancer. Depending on how your load balancers are configured and the zonal shifts that you start, you must make sure that you've carefully planned for managing the increased load on the remaining Availability Zones.

Test starting zonal shifts in advance

Regularly test moving traffic away from Availability Zones for your application by starting zonal shifts. Plan for and execute starting zonal shifts, preferably in both test and production environments, as part of regular failover testing for recovering your applications in the event of a disaster. Regular testing is a critical part of ensuring that you're ready for and have the confidence to mitigate issues when an operational event occurs.

Ensure that all Availability Zones are healthy and taking traffic

Zonal shifts work by marking a resource, that is, an application replica, as unhealthy in an Availability Zone. This means that it's critical to ensure that the targets in the load balancers for your applications are generally healthy and actively taking traffic in the Availability Zones in a Region. We recommend that you have dashboards to track this, including, for example, Elastic Load Balancing metrics for unhealthy targets and bytesProcessed per Availability Zone.

Consider monitoring health of your resources from a second, adjacent Region. Advantages of this approach are that it can be more representative of your end users' experience, and it also reduces the risk of both your application and your monitoring being impacted by the same disaster at the same time ("shared fate").

Use data plane API operations for disaster recovery

For starting a zonal shift when you need to recover an application quickly, with few dependencies, we recommend using the AWS CLI or API with zonal shift actions, with pre-stored credentials, if possible. You can also start zonal shifts in the AWS Management Console, for ease of use. But when fast, reliable recovery is critical, data plane operations are a better choice. For more information, see Zonal Shift API Reference Guide.

Move traffic with a zonal shift only temporarily

A zonal shift moves traffic away from an Availability Zone on a temporary basis, to mitigate an impairment. You should restore the resource for the application to service as soon as you've taken action to correct a problem. This ensures that your overall application is restored to its original fully redundant, resilient state.

Best practices for readiness checks and routing controls in Route 53 ARC

We recommend the following best practices for recovery readiness and failover preparedness when you set up and use Route 53 ARC with readiness checks and routing control, for example, for Regional failover.

Bookmark or hard code your five Regional cluster endpoints and routing control ARNs

We recommend that you keep a local copy of your Route 53 ARC Regional cluster endpoints, in bookmarks or saved in automation code that you use to retry your endpoints. During a failure event, you might not be able to access some API operations, including Route 53 ARC API operations that are not hosted on the extremely reliable data plane cluster. You can list the endpoints for your Route 53 ARC clusters by using the DescribeCluster API operation.

Use the extremely reliable data plane API to list and update routing control states, not the console

Using the Route 53 ARC data plane API, view your routing controls and states with the ListRoutingControls operation and update routing control states to redirect traffic for failover with the UpdateRoutingControlState operation. You can use the AWS CLI (as in these examples) or code that you write using one of the AWS SDKs. Route 53 ARC offers extreme reliability with the API in the data plane to fail over traffic. We recommend using the API instead of changing routing control states in the AWS Management Console.

Connect to one of your Regional cluster endpoints for Route 53 ARC to use the data plane API. If the endpoint is unavailable, try connecting to another cluster endpoint.

If a safety rule blocks a routing control state update, you can bypass it to make the update and fail over traffic. For more information, see Overriding safety rules to reroute traffic.

Retry each of the five Regional cluster endpoints for updating routing control states

You can use any of the Regional cluster endpoints to get or update routing control states. We recommend that your systems be prepared to retry with each of the available endpoints. We recommend that you limit the number of API calls to a cluster endpoint to no more than 3 per second (3 TPS). For information about using code examples with the AWS SDK, including examples for trying cluster endpoints in sequence, see Code examples for Application Recovery Controller using AWS SDKs.

Test failover with Route 53 ARC

Test failover regularly with Route 53 ARC routing control, to fail over from your primary application stack to a secondary application stack. It's important to make sure that the Route 53 ARC structures that you've added are aligned with the correct resources in your stack, and that everything works as you expect it to. You should test this after you set up Route 53 ARC for your environment, and continue to test periodically, so that your failover environment is prepared, before you experience a failure situation in which you need your secondary system to be up and running quickly to avoid downtime for your users.

Add notifications for readiness status changes

Set a rule in Amazon EventBridge to send a notification whenever a readiness check status changes, for example, from READY to NOT READY. When you receive a notification, you can investigate and address the issue, to make sure that your application and resources are ready for failover when you expect them to be.

You can set EventBridge rules to send notifications for several readiness check status changes, including for your recovery group (for your application), for a cell (such as an AWS Region), or for a readiness check for a resource set.

For more information, see Using Route 53 ARC with Amazon EventBridge.