REL01-BP06 Ensure that a sufficient gap exists between the current quotas and the maximum usage to accommodate failover
When a resource fails or is inaccessible, that resource might still be counted against a quota until it’s successfully terminated. Verify that your quotas cover the overlap of failed or inaccessible resources and their replacements. You should consider use cases like network failure, Availability Zone failure, or Regional failures when calculating this gap.
Desired outcome: Small or large failures in resources or resource accessibility can be covered within the current service thresholds. Zone failures, network failures, or even Regional failures have been considered in the resource planning.
Common anti-patterns:
-
Setting service quotas based on current needs without accounting for failover scenarios.
-
Not considering the principals of static stability when calculating the peak quota for a service.
-
Not considering the potential of inaccessible resources in calculating total quota needed for each Region.
-
Not considering AWS service fault isolation boundaries for some services and their potential abnormal usage patterns.
Benefits of establishing this best practice: When a service disruption events impact application availability, the cloud allows you to implement strategies to mitigate or recover from these events. Such strategies often include creating additional resources to replace failed or inaccessible ones. Your quota strategy would accommodate these failover conditions and not layer in additional degradations due to service limit exhaustion.
Level of risk exposed if this best practice is not established: Medium
Implementation guidance
When evaluating quota limits, consider failover cases that might occur due to some degradation. The following types of failover cases should be considered:
-
A VPC that is disrupted or inaccessible.
-
A Subnet that is inaccessible.
-
An Availability Zone has been degraded sufficiently to impact the accessibility of many resources.
-
Various networking routes or ingress and egress points are blocked or changed.
-
A Region has been degraded sufficiently to impact the accessibility of many resources.
-
There are multiple resources but not all are affected by a failure in a Region or an Availability Zone.
Failures like the ones listed could be the reason to initiate a failover event. The decision to failover is unique for each situation and customer, as the business impact can vary dramatically. However, when operationally deciding to failover application or services, the capacity planning of resources in the failover location and their related quotas must be addressed before the event.
Review the service quotas for each service considering the high than normal peaks that might occur. These peaks might be related to resources that can be reached due to networking or permissions but are still active. Unterminated active resources will still be counted against the service quota limit.
Implementation steps
-
Verify that there is enough gap between your service quota and your maximum usage to accommodate for a failover or loss of accessibility.
-
Determine your service quotas, accounting for your deployment patterns, availability requirements, and consumption growth.
-
Request quota increases if necessary. Plan for necessary time for quota increase requests to be fulfilled.
-
Determine your reliability requirements (also known as your number of nines).
-
Establish your fault scenarios (for example, loss of a component, an Availability Zone, or a Region).
-
Establish your deployment methodology (for example, canary, blue/green, red/black, or rolling).
-
Include an appropriate buffer (for example, 15%) to the current limit.
-
Include calculations for static stability (Zonal and Regional) where appropriate.
-
Plan consumption growth (for example, monitor your trends in consumption).
-
Consider the impact of static stability for your most critical workloads. Assess resources conforming to a statically stable system in all Regions and Availability Zones.
-
Consider the use of On-Demand Capacity Reservations to schedule capacity ahead of any failover. This can a useful strategy during the most critical business schedules to reduce potential risks of obtaining the correct quantity and type of resources during failover.
Resources
Related best practices:
Related documents:
-
AWS Well-Architected Framework’s Reliability Pillar: Availability
-
AWS Trusted Advisor Best Practice Checks (see the Service Limits section)
-
APN Partner: partners that can help with configuration management
-
Managing the account lifecycle in account-per-tenant SaaS environments on AWS
-
View AWS Trusted Advisor recommendations at scale with AWS Organizations
-
Automating Service Limit Increases and Enterprise Support with AWS Control Tower
Related videos:
Related tools: