Stage 2: Design and implement
In the previous stage, you set your resilience objectives. Now, at the design and implement stage, you anticipate failure modes and identify design choices. Use the objectives you set in the previous stage to guide your design choices.
Use AWS fault isolation boundaries
In the AWS Cloud, a fault isolation boundary is a boundary such as an Availability Zone, AWS Region, control plane, or data plane that limits the effect of a failure and helps improve the resilience of workloads. To meet resilience objectives in a single AWS Region, focus on avoiding single points of failure in your architectures. If an Availability Zone becomes unavailable, this minimizes the impact.
Startups should use regional services wherever possible. AWS handles high availability by deploying applications and data across multiple Availability Zones for these services. For example, Amazon Simple Storage Service (Amazon S3) is a regional service that spreads requests and data across multiple Availability Zones. It is designed to automatically recover from an Availability Zone failure. You interact only with the regional endpoint of the service.
If you cannot use regional services in your design, use AWS managed zonal services. AWS managed zonal services help you deploy your applications across multiple Availability Zones, providing high availability. AWS provides built-in resiliency capabilities with these services, such as a single-click option to deploy applications across multiple Availability Zones, re-routing traffic from impaired Availability Zones, handling underlying hardware failures, and automating data backups. For example, Amazon Relational Database Service (Amazon RDS) is a zonal service that supports a multi-AZ cluster setup that, in the event of a failure, automatically fails over to a standby database instance in another Availability Zone.
For zonal services that are not fully managed by AWS, you should use Elastic Load Balancing resources and Amazon EC2 Auto Scaling to deploy your applications across multiple Availability Zones. For production workloads, distribute your applications across two Availability Zones. For customer-facing critical workloads, distribute your application across three Availability Zones, for improved high availability (HA).
You should also use multiple AWS accounts as additional fault isolation boundaries to separate production resources from development and staging resources. This provides developers and engineers with more agility to modify resources, AWS Identity and Access Management (IAM) permissions, and resource permissions, without accidentally affecting production workloads.
Define a disaster recovery strategy
From a disaster recovery (DR) perspective, a backup and restore with rapid recovery
Define a deployment strategy
Adopting a complete continuous integration and continuous delivery (CI/CD) pipeline adds complexity during the early stages of a startup. Start by building infrastructure as code (IaC) and use pipelines to deploy it. This sets a foundation to adopt a complete continuous integration and continuous delivery (CI/CD) pipeline as the organization matures.
Use canary or blue/green deployment features to rollout new application versions. A canary deployment releases a version slowly and incrementally. When you are confident, you deploy the new version and replace the current version in its entirety. A blue/green deployment strategy is when you create two separate but identical environments, and you run the current application version in one environment (blue) and the new application version in the other environment (green). This allows quick rollbacks with minimal impact.
Maintain all application code, configurations, and IaC in a version-controlled, managed repository. When possible, use built-in versioning capabilities when deploying managed services, such as Amazon S3 versioning.
Create soft dependencies
If your application has a third-party dependency, any issues with the dependency can impair your ability to operate or scale as expected. Wherever possible, reduce these hard, external dependencies especially for autoscaling workloads. As you build your product to go to market quickly, you might use various open source and third-party tools and components. For these dependencies, design your software with a default fail-safe mode or a circuit breaker. Your application can use alternative resources or operate with reduced functionality instead of completely failing.