REL10-BP03 Automate recovery for components constrained to a single location

If components of the workload can only run in a single Availability Zone or in an on-premises data center, you must implement the capability to do a complete rebuild of the workload within your defined recovery objectives.

If the best practice to deploy the workload to multiple locations is not possible due to technological constraints, you must implement an alternate path to resiliency. You must automate the ability to recreate necessary infrastructure, redeploy applications, and recreate necessary data for these cases.

For example, Amazon EMR launches all nodes for a given cluster in the same Availability Zone because running a cluster in the same zone improves performance of the jobs flows as it provides a higher data access rate. If this component is required for workload resilience, then you must have a way to redeploy the cluster and its data. Also for Amazon EMR, you should provision redundancy in ways other than using Multi-AZ. You can provision multiple nodes. Using EMR File System (EMRFS), data in EMR can be stored in Amazon S3, which in turn can be replicated across multiple Availability Zones or AWS Regions.

Similarly, for Amazon Redshift, by default it provisions your cluster in a randomly selected Availability Zone within the AWS Region that you select. All the cluster nodes are provisioned in the same zone.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Implement self-healing. Deploy your instances or containers using automatic scaling when possible. If you cannot use automatic scaling, use automatic recovery for EC2 instances or implement self-healing automation based on Amazon EC2 or ECS container lifecycle events.
- Use Auto Scaling groups for instances and container workloads that have no requirements for a single instance IP address, private IP address, Elastic IP address, and instance metadata.
  - What Is EC2 Auto Scaling?
  - Service automatic scaling
    
    The launch template user data can be used to implement automation that can self-heal most workloads.
- Use automatic recovery of EC2 instances for workloads that require a single instance ID address, private IP address, Elastic IP address, and instance metadata.
  - Recover your instance.
    
    Automatic Recovery will send recovery status alerts to a SNS topic as the instance failure is detected.
- Use EC2 instance lifecycle events or ECS events to automate self-healing where automatic scaling or EC2 recovery cannot be used.
  - EC2 Auto Scaling lifecycle hooks
  - Amazon ECS events
    
    Use the events to invoke automation that will heal your component according to the process logic you require.

Resources

Related documents:

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

REL10-BP02 Select the appropriate locations for your multi-location deployment

REL10-BP04 Use bulkhead architectures to limit scope of impact