Shared Responsibility Model for Resiliency - Reliability Pillar

Shared Responsibility Model for Resiliency

Resiliency is a shared responsibility between AWS and you. It is important that you understand how disaster recovery (DR) and availability, as part of resiliency, operate under this shared model.

AWS responsibility - Resiliency of the cloud

AWS is responsible for resiliency of the infrastructure that runs all of the services offered in the AWS Cloud. This infrastructure comprises the hardware, software, networking, and facilities that run AWS Cloud services. AWS uses commercially reasonable efforts to make these AWS Cloud services available, ensuring service availability meets or exceeds AWS Service Level Agreements (SLAs).

The AWS Global Cloud Infrastructure is designed to allow customers to build highly resilient workload architectures. Each AWS Region is fully isolated and consists of multiple Availability Zones, which are physically isolated partitions of infrastructure. Availability Zones isolate faults that could impact workload resilience, preventing them from impacting other zones in the Region. But at the same time, all zones in an AWS Region are interconnected with high-bandwidth, low-latency networking, over fully redundant, dedicated metro fiber providing high-throughput, low-latency networking between zones. All traffic between zones is encrypted. The network performance is sufficient to accomplish synchronous replication between zones. When an application is partitioned across AZs, companies are better isolated and protected from issues such as power outages, lightning strikes, tornadoes, hurricanes, and more.

Customer responsibility - Resiliency in the cloud

Your responsibility is determined by the AWS Cloud services that you select. This determines the amount of configuration work you must perform as part of your resiliency responsibilities. For example, a service such as Amazon Elastic Compute Cloud (Amazon EC2) requires the customer to perform all of the necessary resiliency configuration and management tasks. Customers that deploy Amazon EC2 instances are responsible for deploying Amazon EC2 instances across multiple locations (such as AWS Availability Zones), implementing self-healing using services like Auto Scaling, and using resilient workload architecture best practices for applications installed on the instances. For managed services, such as Amazon S3 and Amazon DynamoDB, AWS operates the infrastructure layer, the operating system, and platforms, and customers access the endpoints to store and retrieve data. You are responsible for managing resiliency of your data including backup, versioning, and replication strategies.

Deploying your workload across multiple Availability Zones in an AWS Region is part of a high availability strategy designed to protect workloads by isolating issues to one Availability Zone, which uses the redundancy of the other Availability Zones to continue serving requests. A Multi-AZ architecture is also part of a DR strategy designed to make workloads better isolated and protected from issues such as power outages, lightning strikes, tornadoes, earthquakes, and more. DR strategies may also make use of multiple AWS Regions. For example, in an active/passive configuration, service for the workload fails over from its active Region to its DR Region if the active Region can no longer serve requests.

Chart illustrating the shared resiliency model.

Responsibility for resilience in and of the cloud for customers and AWS.

You can use AWS services to achieve your resilience objectives. As a customer, you are responsible for management of the following aspects of your system to achieve resilience in the cloud. For more detail on each service in particular, see AWS documentation.

Networking, quotas, and constraints

  • Best practices for this area of the shared responsibility model are described in detail under Foundations.

  • Plan your architecture with adequate room to scale and understand the service quotas and constraints of the services you include, based on expected load request increases where applicable.

  • Design your network topology to be highly available, redundant, and scalable.

Change management and operational resilience

  • Change management includes how to introduce and manage change in your environment. Implementing change requires building and keeping runbooks up to date and deployment strategies for your application and infrastructure.

  • A resilient strategy for monitoring workload resources considers all components, including both technical and business metrics, notifications, automation, and analysis.

  • Workloads in the cloud must adapt to changes in demand scaling in reaction to impairments or fluctuations in usage.

Observability and failure management

Workload architecture

  • Your workload architecture includes how you design services around business domains, apply SOA and distributed system design to prevent failures, and build in capabilities like throttling, retries, queue management, timeouts, and emergency levers.

  • Rely on proven AWS solutions, the Amazon Builders Library, and serverless patterns to align with best practices and jump start implementations.

  • Use continuous improvement to decompose your system into distributed services to scale and innovate faster. Use AWS microservices guidance and managed service options to simplify and accelerate your ability to introduce change and innovate.

Continuous testing of critical infrastructure

  • Testing reliability means testing at the functional, performance, and chaos levels, as well as adopting incident analysis and game day practices to build expertise in resolving issues that are not well understood.

  • For both cloud all-in and hybrid applications, knowing how your application behaves when issues arise or components go down allows you to quickly and reliably recover from outages.

  • Create and document repeatable experiments to understand how your system behaves when things don’t work as expected. These tests will prove effectiveness of your overall resilience and provide a feedback loop for your operational procedures before facing real failure scenarios.