Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

Control plane-controlled evacuation - Advanced Multi-AZ Resilience Patterns

This whitepaper is for historical reference only. Some content might be outdated and some links might not be available.

This whitepaper is for historical reference only. Some content might be outdated and some links might not be available.

Control plane-controlled evacuation

The first pattern uses data plane operations to prevent performing work in an impacted Availability Zone to mitigate the impact of an event. However, you may be using an architecture that doesn’t use load balancers or where configuring a per-host health check isn’t feasible. Or, you may want to prevent new capacity from being deployed into the impacted Availability Zone through Auto Scaling or normal work scheduling.

To address both situations, control plane actions are required to update the configuration of the resource. The pattern will work for any service whose network configuration can be updated, for example, EC2 Auto Scaling, Amazon ECS, Lambda, and more. It requires writing code for each service, but the business logic follows a standard pattern. The code should be executed locally by an operator responding to the event in order to minimize the dependencies required. The basic flow of the script logic is shown in the following figure.

Diagram showing control plane update to evacuate an Availability Zone

Control plane update to evacuate an Availability Zone

  1. The script lists all of the resources of the specified type, such as Auto Scaling group, ECS service, or Lambda function, and retrieves their subnets from the resource information. The supported resources depend on what the script has been configured to support.

  2. It determines which subnets should be removed by comparing each subnet’s Availability Zone name to its mapped Availability Zone ID that was provided as an input parameter.

  3. The network configuration of the resource is updated to remove the identified subnets.

  4. The details of the update are recorded in a DynamoDB table. The Availability Zone ID is stored as the partition key and the resource ARN or name is stored as the sort key. The subnets that were removed are stored as a string array. Finally, the resource type is also stored and used as a hash key for a Global Secondary Index (GSI).

Because step four records the updates that were made, this approach also lends itself to being easily reversible when you’re ready to recover, as shown in the following figure.

Diagram showing control plane update to recover from Availability Zone evacuation

Control plane update to recover from Availability Zone evacuation

Recovery steps:

  1. Query the GSI to get the subnets removed for each resource of the specified type in the specified Availability Zone (or all Availability Zones if one isn’t specified).

  2. Describe each resource found in the DynamoDB query to get its current network configuration.

  3. Combine the subnets from the current network configuration with those retrieved from the DynamoDB query.

  4. Update the network configuration of the resource with the new subnet set.

  5. Remove the record from the DynamoDB table after the update completes successfully.

This generalized pattern both prevents routing work to the impacted Availability Zone and prevents new capacity from being deployed there. The following are examples of how this is accomplished for different services.

Each service will react differently to the configuration update. For example, Amazon ECS will follow the service’s deployment configuration after an update and trigger a rolling deployment or blue/green deployment of new tasks.

These updates may shift work to the healthy Availability Zones too quickly for some workloads. While being configured to be statically stable to the failure (having enough capacity pre-provisioned in the remaining Availability Zones to handle the impacted Availability Zone’s work), you may also want to gradually phase out capacity from the impacted Availability Zone.

If you plan to update the network configuration of your Auto Scaling group that is a target group for a load balancer with cross-zone load balancing disabled, follow this guidance.

Auto Scaling reacts to this change using its Availability Zone rebalancing logic. It will launch instances in the other Availability Zones to meet your desired capacity and terminate instances in the Availability Zone you removed. However, the load balancer will continue to split traffic evenly across each Availability Zone, including the one you removed from the ASG, while the instances are being terminated. This could lead to a brown out of the remaining capacity in that Availability Zone until all instances are successfully terminated there. This is the same problem described in Availability Zone independence concerning Availability Zone imbalance when cross-zone load balancing is disabled. To prevent this from occurring, you can either:

  • Always perform your Availability Zone evacuation first so traffic is only being split among the remaining Availability Zones

  • Specify a minimum healthy target count with DNS failover to match your required minimum target count for that Availability Zone.

This will help ensure traffic is not sent to the Availability Zone you removed after instances start being terminated.

PrivacySite termsCookie preferences
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved.