Availability and continuity management
Ensure availability of business-critical information, applications, and services.
Availability and continuity management, while included as one operational capability, have separate implications on the operations of your critical workloads. Continuity management deals with how your workload is architected to respond to a one-off event that causes a serious negative impact to your business. To maintain business continuity, you must work backwards from your organization’s objectives to develop a strategy for avoiding loss of data and reducing downtime where your workload is not usable for customers. Availability focuses on more common smaller scale disruptions, such as network issues, software bugs, component failures, and load spikes. Availability and disaster recovery, which helps facilitate continuity management, make up the resiliency of your workloads.
Start
When getting started with your cloud adoption journey, you should
focus on weaving in availability objectives for each application
hosted in the cloud. Build highly available architecture by
deploying Amazon EC2 instances to multiple
Availability
Zones
Another aspect of ensuring availability is monitoring the health of your workloads. Ensure monitoring is enabled through Amazon CloudWatch on all existing and newly provisioned instances to collect metrics and logs related to operational KPIs. Establish baseline monitoring to assess early indicators of performance degradation that will eventually impact availability of applications and business continuity.
When getting started with continuity management, focus on ensuring processes are in place to recover from failures. To begin with, enable AWS Backup for all relevant services. Adjust backup plans and retention period for critical environments to optimize cost. You can back up Amazon EC2 instances used by your workload as AMIs. Set up disaster recovery plans for your most business critical applications and adapt the plans based on your Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs).
You can also use services such as AMS to accelerate the
implementation and orchestration of high availability and business
continuity in your AWS environment through automation flywheels
and operational experience.
AMS
Advance
Once your initial availability and continuity management capabilities have been operationalized, you should consider further improving them to optimize your business objectives and lower your business risk. Granular and context-based monitoring helps make informed decisions about the availability of your workloads. This includes correlation of multiple data points from Amazon CloudWatch and AWS Health Dashboard. This helps you understand workload health against business outcomes and declare outages to ensure business continuity objectives are met.
Refine the backups that were set up as part of your cloud foundations to holistically assess the strategy to balance business impact and risk; for example, consider using Amazon S3 Glacier Deep Archive to help reduce cost. Refine continuity management procedures through inspection of disaster recovery strategies for critical applications.
You should work with multiple stakeholders in the organization to
understand your RTO and RPO requirements for each application and
potential impact to end users. For the most stringent RTO
requirements, you can implement automated failovers based on
health
checks
Automate your disaster recovery plans to reduce manual burden and
margin of error. This includes automating redeployment of
infrastructure, configuration, and application code in a recovery
region using
AWS CloudFormation and
AWS CodePipeline. Manage configuration drift in the recovery
Region using AWS Config and
AWS service quotas to continuously monitor AWS resource
configurations. To help drive automation in business continuity
procedures, use services such as
Amazon Route 53 for global routing,
AWS Global Accelerator for traffic routing for multi-Region
applications, and Amazon CloudFront for traffic management during
recovery. Test your continuity management strategy through
operational
game
days
Excel
As you mature your cloud operations and ensure that your
availability and continuity management capabilities are meeting
the operational KPIs for the business, you should further
fine-tune the overall operating mechanisms that lend themselves to
achieving high availability. Adopt the
Correction
of Error
For your most business-critical workloads, push the limits of
operational excellence standards by including
chaos
engineering