The reliability pillar encompasses the ability of a system to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions such as misconfigurations or transient network issues.
Best Practices
There are three best practice areas for reliability in the cloud:
-
Foundations
-
Change management
-
Failure management
To achieve reliability, a system must have a well-planned foundation and monitoring in place, with mechanisms for handling changes in demand or requirements. The system should be designed to detect failure and automatically heal itself.
Foundations
HN_REL1: How do you manage AWS service quotas for hybrid networking services? |
---|
AWS sets service quotas (also called service limits) to protect you from accidentally over-provisioning resources. A service quota is an upper limit on the number of each resource your team can request. You will need to have governance and processes in place to monitor and change these quotas to meet your business needs. As you adopt the cloud, plan integration with existing on-premises resources (a hybrid approach). A hybrid model allows the gradual transition to an all-in cloud approach over time. Therefore, it’s important to have a design for how your AWS and on-premises resources will interact as a network topology.
If you are using AWS Direct Connect, there are quotas on the amount of data that can be transferred on each connection. Currently, you can have a dedicated connection of 1Gbps, 10Gbps, or 100 Gbps bandwidth and if you need more bandwidth you can order Link Aggregation Groups ( LAG ) of two 100 Gbps, totaling a total of 200 Gbps aggregate bandwidth. If you are using an AWS Site-to-site VPN connection to access resources in a VPC, then you are cumulatively bound by the virtual gateway throughput of 1.25 Gbps.
Service quotas need to be increased from the default values to handle the requirements of
a large deployment per your business needs. For supported services, you can proactively manage
your quotas by configuring Amazon CloudWatch alarms that monitor usage and alert you to approaching
quotas by accessing the Service Quotas in
the AWS Management Console
You can also proactively raise quotas if you anticipate exceeding them in your workloads. When raising these quotas, ensure that there is a sufficient gap between your service quota and your maximum usage to accommodate scale.
Change Management
Being aware of how change affects a system enables you to plan proactively, and monitoring enables you to quickly identify trends that could lead to capacity issues or SLA breaches.
HN_REL2: How do you prepare for AWS Direct Connect scheduled maintenance or events? |
---|
HN_REL3: How do you regulate bandwidth usage for Direct Connect connections and executing changes? |
---|
HN_REL4: How do you monitor your Direct Connect connections and Site-to-Site VPN? |
---|
Logs and metrics are a powerful tool to gain insight into the health of your workloads.
Configure your workload to monitor CloudWatch logs and metrics to send notifications when thresholds
are crossed or significant events occur. For example, with AWS Direct Connect,
ConnectionBpsIngress
, ConnectionBpsEgress
,
ConnectionPpsEgress
, and ConnectionPpsIngress
metrics help track
the connection capacity utilization. For a comprehensive list of metrics, refer to the AWS Direct Connect User Guide. For information about VPN metrics, refer to the AWS VPN User
Guide.
When an AWS Direct Connect connection is down for maintenance, that connection can be down from a few minutes to a few hours based on the level of maintenance required. To prepare for this downtime, consider one or more of the following options:
Monitor the bandwidth usage on the Direct Connect connection and increase bandwidth for
active traffic on a Direct Connect connection by ordering more Direct Connect connections and
aggregating them to form a Link Aggregation Group (LAG). To increase the bandwidth with
minimal downtime using LAG and migrating virtual interface from a single Direct Connect
connection to LAG, refer to Knowledge
Center
Failure management
HN_REL5: How does your system withstand component failures? |
---|
HN_REL6: How are you testing for resiliency? |
---|
HN_REL7: How are you planning for disaster recovery? |
---|
In any system of reasonable complexity, it’s expected that failures will occur. Know how to become aware of these failures and respond to them automatically, to ensure your network can withstand the failures and not affect the existing workload over it.
Highly resilient network connections are key to a well-architected system. AWS
recommends connecting from multiple data centers for physical location redundancy. For more
information, refer to AWS Direct Connect resilency recommendations
Each Site-to-Site VPN connection has two tunnels, with each tunnel using a unique virtual private gateway public IP address. It’s important to configure both tunnels for redundancy by preferably using dynamic routing, Active/Active setup. When one tunnel becomes unavailable (for example, is down for maintenance or unplanned outage), network traffic is automatically routed to the available tunnel for that specific Site-to-Site VPN connection.
Test the Direct Connect failover scenarios to help you find any latent bugs that could surface in production. Exercise these tests regularly to ensure that your configurations are appropriate for failovers and verify the impact on workload during these tests. These tests help in validating your recovery procedures. You can use the Resiliency Toolkit - Failover Testing feature to test the resiliency of the Direct Connect connections. The failover testing feature allows you to test resiliency by disabling one or more Border Gateway Protocol (BGP) sessions using the AWS Management Console, Command Line Interface, or AWS Direct Connect API. It allows you to shut down BGP sessions for a configurable time period. You can also cancel failover tests at any time during the testing period to return to the pre-test configuration. Alternatively, use automation to simulate different failures or to recreate scenarios that led to failures before. This exposes failure pathways that you can test and rectify before a real failure scenario, reducing the risk of components failing that have not been tested before.
Resources
Refer to the following resources to learn more about our best practices related to reliability.