Reliability - Hybrid Connectivity

Reliability

Definition

Reliability refers to the ability of a service or system to perform its expected function when required. Also, the reliability of a system can be measured by the level of its operational quality within a given timeframe. On the other hand, resiliency refers to the ability of a system to recover from infrastructure or service disruptions, dynamically and reliably.

For more details about how availability and resiliency are used to measure reliability, see the AWS Well-Architected Framework, Reliability Pillar.

Impact on the design decision

The following are the primary aspects that can influence the design decision:

  • The required level of availability – The availability of a connection to AWS needs to be evaluated. This would be based on the level of criticality to the business with regard to the connected on-premises site, criticality of the resources access in AWS site and the impact magnitude of a downtime on the business.

  • The required level of resiliency – Evaluate the impact of a downtime duration. There are use cases that demand minimum downtime. Examples include financial services, and critical infrastructure management services.

Similarly, if the redundant network components (for example, links or network devices) are not reliable enough to provide the expected function on their own, such as connection performance, then this means low resiliency to failures. This would result in poor and degraded user experience.

Requirements definition

  • What is the impact magnitude on the business in case of a connectivity failure to AWS?

  • What is the required level of availability (up time target, e.g., 99.9%) per on-premises site that need to be connected to AWS?

  • From a business point of view, does the cost following a connectivity failure to AWS outweigh the cost of deploying a highly reliable connectivity model to AWS?

  • Is the cost in anyway the primary limiting factor to the business?

Technical Solution

Reliability of a hybrid network connection depends on several factors, such as security, operational excellence, etc. However, the primary factor is resiliency. Therefore, resiliency is the primary focus of this section.

First, it is important to note that not every hybrid network connectivity solution or business requires a high level of reliability. That’s why we recommend that you evaluate first, so that you are able to identify where a high level of reliability is needed. In some scenarios, a primary site may require reliable (redundant and resilient) connections because the downtime has higher impact on the business. Regional sites, however, may not require same level of reliability due to the lower impact on the business in case of a failure event. We recommend that you refer to the AWS Direct Connect Resiliency Recommendations. This resource explains the AWS best practices for ensuring high resiliency with AWS Direct Connect design.

To achieve a reliable hybrid network connectivity solution in the context of resiliency, the design needs to take into consideration the following aspects.

Redundancy

Redundancy aims to eliminate any single point of failure in the hybrid network connectivity path, including but not limited to; network connections, edge network devices, redundancy across Availability Zones, AWS Regions, and DX locations, as well as power sources to the devices, etc. For the purpose and scope of this whitepaper, redundancy focuses on the network connections, edge devices (for example, customer gateway devices), AWS DX location, and AWS Regions (for multi-Region architectures).

Reliable failover components

In some scenarios, a system might be available (functioning) but it is not performing its functions at the minimum required level. From an operational quality point of view, the system may not be delivering the intended service or performing a function reliably even it is technically available. Such a situation is very common, following a failover event, to a redundant component (for example, redundant network link). This is where a reliable redundant component can help to mitigate or eliminate such situation and make the overall design more resilient.

Failover time

Failover time is another key aspect of resiliency, because you might have a redundant component or connection. However, following a failure event of the primary component, it takes long time for the secondary component to start taking over, or even requires manual configuration update by administrator. This typically leads to a less resilient solution even though there is redundant component in the design.

As part of the hybrid network connectivity, one of the key metrics that a cloud and network architect need to consider is the failure detection time. This is an important aspect because you might have a routing design or protocol tuned to fail over very quickly, however, this failover operation won’t be triggered fast if the failure itself is not reported fast enough to the routing protocol.

There are many techniques to speed up the failure detection, depending on things like connectivity type and physical medium. With AWS hybrid connectivity, if you are using VPN, you might need to look into VPN dead peer detection. If you are working with AWS DX connection, you need to look at Bidirectional Forwarding Detection (BFD), which helps to achieve a faster routing re-convergence time, and will result in a higher solution resiliency. Note that Asynchronous BFD is automatically enabled for DX virtual interfaces on the AWS side. However, you must configure your router for asynchronous BFD to enable it for your connection. For more information, see How do I enable BFD for my DX connection?

In addition, you must consider the on-premises network side, because if there is any routing protocol deployed, peering with multiple CGWs, this routing protocol needs to converge as well. Each routing protocol has its own convergence speed, and recommended tuning timers to achieve fast and stable convergence time.

Traffic engineering

Traffic engineering, in the context of resilient hybrid network connectivity design, aims to address how traffic should flow over multiple available connections in both normal and failure scenarios. Therefore, we recommend that you follow the concept of “design for failure,” where you need to look at how the solution will operate in different failure scenarios and whether it will be acceptable by the business or not. This section discusses some of the common traffic engineering use cases that aim to enhance the overall resiliency level of the hybrid network connectivity solution.

To design an effective traffic engineering solution, you need to have a good understanding of how each of the AWS networking components (Amazon VPC and gateways), handle IP routing in terms route evaluation and selection, as well as the possible mechanisms to influence the route selection. The figure below illustrates a summarized version of how the IP routing decision is made at different level within AWS from the VPC to the networking gateways components.

Routing evaluation within Amazon VPC and associated
                gateways

Figure 1 – Routing evaluation within Amazon VPC and associated gateways

Note

In the VPC route table, you might reference a prefix list that has additional route selection rules. For more information about this use case, see Route priority. AWS Transit Gateway route tables also support prefix lists but, once applied, they get expanded to specific route entries.