Availability and continuity management - AWS Cloud Adoption Framework: Operations Perspective

Questa pagina non è tradotta nella tua lingua. Richiedi traduzione

Availability and continuity management

Ensure availability of business-critical information, applications, and services.

Availability and continuity management, while included as one operational capability, have separate implications on the operations of your critical workloads. Continuity management deals with how your workload is architected to respond to a one-off event that causes a serious negative impact to your business. To maintain business continuity, you must work backwards from your organization’s objectives to develop a strategy for avoiding loss of data and reducing downtime where your workload is not usable for customers. Availability focuses on more common smaller scale disruptions, such as network issues, software bugs, component failures, and load spikes. Availability and disaster recovery, which helps facilitate continuity management, make up the resiliency of your workloads.

Start

When getting started with your cloud adoption journey, you should focus on weaving in availability objectives for each application hosted in the cloud. Build highly available architecture by deploying Amazon EC2 instances to multiple Availability Zones (AZs) and using AWS Auto Scaling to facilitate self-healing architecture. Ensure high availability of your most critical data by enabling backups to another AWS Region using Amazon S3 Cross-Region Replication (CRR).

Another aspect of ensuring availability is monitoring the health of your workloads. Ensure monitoring is enabled through Amazon CloudWatch on all existing and newly provisioned instances to collect metrics and logs related to operational KPIs. Establish baseline monitoring to assess early indicators of performance degradation that will eventually impact availability of applications and business continuity.

When getting started with continuity management, focus on ensuring processes are in place to recover from failures. To begin with, enable AWS Backup for all relevant services. Adjust backup plans and retention period for critical environments to optimize cost. You can back up Amazon EC2 instances used by your workload as AMIs. Set up disaster recovery plans for your most business critical applications and adapt the plans based on your Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs).

You can also use services such as AMS to accelerate the implementation and orchestration of high availability and business continuity in your AWS environment through automation flywheels and operational experience. AMS provides proactive monitoring and incident management (infrastructure and security) in your AWS accounts. AMS then goes further to develop a deeper operational understanding of the typical use cases, issues, and top failure and recovery scenarios and helps you perform failover tests.

Advance

Once your initial availability and continuity management capabilities have been operationalized, you should consider further improving them to optimize your business objectives and lower your business risk. Granular and context-based monitoring helps make informed decisions about the availability of your workloads. This includes correlation of multiple data points from Amazon CloudWatch and AWS Health Dashboard. This helps you understand workload health against business outcomes and declare outages to ensure business continuity objectives are met.

Refine the backups that were set up as part of your cloud foundations to holistically assess the strategy to balance business impact and risk; for example, consider using Amazon S3 Glacier Deep Archive to help reduce cost. Refine continuity management procedures through inspection of disaster recovery strategies for critical applications.

You should work with multiple stakeholders in the organization to understand your RTO and RPO requirements for each application and potential impact to end users. For the most stringent RTO requirements, you can implement automated failovers based on health checks. This should be examined regularly to ensure false signals are not triggering the failover as that can be expensive and add its own availability risks. Based on requirements and considerations for cost and complexity, you should select the most suitable disaster recovery strategies: backup and restore, pilot light, warm standby, or multi-site active/active.

Automate your disaster recovery plans to reduce manual burden and margin of error. This includes automating redeployment of infrastructure, configuration, and application code in a recovery region using AWS CloudFormation and AWS CodePipeline. Manage configuration drift in the recovery Region using AWS Config and AWS service quotas to continuously monitor AWS resource configurations. To help drive automation in business continuity procedures, use services such as Amazon Route 53 for global routing, AWS Global Accelerator for traffic routing for multi-Region applications, and Amazon CloudFront for traffic management during recovery. Test your continuity management strategy through operational game days to ensure that orchestration of failover is triggered in a timely manner and that the recovery is completed in line with your RTO and RPO requirements. Operational game days are also valuable for testing key stakeholder engagement and escalation process within the organization.

Excel

As you mature your cloud operations and ensure that your availability and continuity management capabilities are meeting the operational KPIs for the business, you should further fine-tune the overall operating mechanisms that lend themselves to achieving high availability. Adopt the Correction of Error (COE) process to improve the quality of operations by documenting and driving problem management of recurring issues. Define a standardized way to document critical root causes and ensure they are reviewed and addressed. Use the mechanism of iterative improvement (flywheel) to increase your operational efficiency and improve the business continuity KPIs.

For your most business-critical workloads, push the limits of operational excellence standards by including chaos engineering. Business owners can use AWS FIS to simulate the most complex and thoughtfully induced failures, and as a result help teams identify monitoring blind spots and performance bottlenecks. This exercise can help improve the availability of the most complex and distributed applications. Look to mature your continuity management plan, where disaster recovery is only a subset of this plan. Review other aspects of your business that could be impacted by a one-off failure event and develop mechanisms to periodically undertake business impact analysis and risk assessment to verify that the RTO and RPO objectives are aligned with the business objectives of your workload.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Patch management

Application management

Seleziona le tue preferenze relative ai cookie

Personalizza le tue preferenze relative ai cookie

Essenziali

Prestazione

Funzionali

Pubblicitari

Impossibile salvare le preferenze dei cookie

Availability and continuity management

Start

Advance

Excel

Questa pagina ti è stata utile?

Argomento successivo:

Argomento precedente:

Hai bisogno di aiuto?