Reliability pillar

Focus mode

Reliability pillar - Hybrid Networking Lens

Best Practices Foundations Change Management Failure management Resources

The reliability pillar encompasses the ability of a system to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions such as misconfigurations or transient network issues.

Best Practices

There are three best practice areas for reliability in the cloud:

Foundations
Change management
Failure management

To achieve reliability, a system must have a well-planned foundation and monitoring in place, with mechanisms for handling changes in demand or requirements. The system should be designed to detect failure and automatically heal itself.

Foundations

HN_REL1: How do you manage AWS service quotas for hybrid networking services?

AWS sets service quotas (also called service limits) to protect you from accidentally over-provisioning resources. A service quota is an upper limit on the number of each resource your team can request. You will need to have governance and processes in place to monitor and change these quotas to meet your business needs. As you adopt the cloud, plan integration with existing on-premises resources (a hybrid approach). A hybrid model allows the gradual transition to an all-in cloud approach over time. Therefore, it’s important to have a design for how your AWS and on-premises resources will interact as a network topology.

If you are using AWS Direct Connect, there are quotas on the amount of data that can be transferred on each connection. Currently, you can have a dedicated connection of 1Gbps, 10Gbps, or 100 Gbps bandwidth and if you need more bandwidth you can order Link Aggregation Groups ( LAG ) of two 100 Gbps, totaling a total of 200 Gbps aggregate bandwidth. If you are using an AWS Site-to-site VPN connection to access resources in a VPC, then you are cumulatively bound by the virtual gateway throughput of 1.25 Gbps.

Service quotas need to be increased from the default values to handle the requirements of a large deployment per your business needs. For supported services, you can proactively manage your quotas by configuring Amazon CloudWatch alarms that monitor usage and alert you to approaching quotas by accessing the Service Quotas in the AWS Management Console. Contact AWS Support to request an increase for services not currently supported by Service Quotas.

You can also proactively raise quotas if you anticipate exceeding them in your workloads. When raising these quotas, ensure that there is a sufficient gap between your service quota and your maximum usage to accommodate scale.

Change Management

Being aware of how change affects a system enables you to plan proactively, and monitoring enables you to quickly identify trends that could lead to capacity issues or SLA breaches.

HN_REL2: How do you prepare for AWS Direct Connect scheduled maintenance or events?

HN_REL3: How do you regulate bandwidth usage for Direct Connect connections and executing changes?

HN_REL4: How do you monitor your Direct Connect connections and Site-to-Site VPN?

Logs and metrics are a powerful tool to gain insight into the health of your workloads. Configure your workload to monitor CloudWatch logs and metrics to send notifications when thresholds are crossed or significant events occur. For example, with AWS Direct Connect, ConnectionBpsIngress, ConnectionBpsEgress, ConnectionPpsEgress, and ConnectionPpsIngress metrics help track the connection capacity utilization. For a comprehensive list of metrics, refer to the AWS Direct Connect User Guide. For information about VPN metrics, refer to the AWS VPN User Guide.

When an AWS Direct Connect connection is down for maintenance, that connection can be down from a few minutes to a few hours based on the level of maintenance required. To prepare for this downtime, consider one or more of the following options:

Request a redundant Direct Connect connection.

Configure a virtual private network (VPN) connection as a backup.

Monitor the bandwidth usage on the Direct Connect connection and increase bandwidth for active traffic on a Direct Connect connection by ordering more Direct Connect connections and aggregating them to form a Link Aggregation Group (LAG). To increase the bandwidth with minimal downtime using LAG and migrating virtual interface from a single Direct Connect connection to LAG, refer to Knowledge Center.

Failure management

HN_REL5: How does your system withstand component failures?

HN_REL6: How are you testing for resiliency?

HN_REL7: How are you planning for disaster recovery?

In any system of reasonable complexity, it’s expected that failures will occur. Know how to become aware of these failures and respond to them automatically, to ensure your network can withstand the failures and not affect the existing workload over it.

Highly resilient network connections are key to a well-architected system. AWS recommends connecting from multiple data centers for physical location redundancy. For more information, refer to AWS Direct Connect resilency recommendations. When designing remote connections, consider using redundant hardware and telecommunications providers. Your telecommunication provider should have a diverse fiber and path connectivity for your last mile connection or offer any SLA. Ensure that the physical infrastructure configuration you set up is in accordance with the requirements to meet AWS Direct Connect SLA and AWS VPN SLA. Additionally, use dynamically routed, Active/Active connections for automatic load balancing and failover across redundant network connections. Provision sufficient network capacity to ensure that the failure of one network connection does not overwhelm and degrade redundant connections.

Each Site-to-Site VPN connection has two tunnels, with each tunnel using a unique virtual private gateway public IP address. It’s important to configure both tunnels for redundancy by preferably using dynamic routing, Active/Active setup. When one tunnel becomes unavailable (for example, is down for maintenance or unplanned outage), network traffic is automatically routed to the available tunnel for that specific Site-to-Site VPN connection.

Test the Direct Connect failover scenarios to help you find any latent bugs that could surface in production. Exercise these tests regularly to ensure that your configurations are appropriate for failovers and verify the impact on workload during these tests. These tests help in validating your recovery procedures. You can use the Resiliency Toolkit - Failover Testing feature to test the resiliency of the Direct Connect connections. The failover testing feature allows you to test resiliency by disabling one or more Border Gateway Protocol (BGP) sessions using the AWS Management Console, Command Line Interface, or AWS Direct Connect API. It allows you to shut down BGP sessions for a configurable time period. You can also cancel failover tests at any time during the testing period to return to the pre-test configuration. Alternatively, use automation to simulate different failures or to recreate scenarios that led to failures before. This exposes failure pathways that you can test and rectify before a real failure scenario, reducing the risk of components failing that have not been tested before.

Resources

Refer to the following resources to learn more about our best practices related to reliability.

Documents

Maximizing resiliency with AWS Direct Connect

AWS Support

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Security

Performance efficiency

Next topic:

Performance efficiency

Previous topic:

Security

Select your cookie preferences

Customize cookie preferences

Essential

Performance

Functional

Advertising

Unable to save cookie preferences