Reliability - Hybrid Connectivity

Reliability

Definition

Reliability refers to the ability of a service or system to perform its expected function when required. Also, the reliability of a system can be measured by the level of its operational quality within a given timeframe. On the other hand, resiliency refers to the ability of a system to recover from infrastructure or service disruptions, dynamically and reliably.

For more details of how availability and resiliency are used to measure reliability refer to the AWS Well-Architected Framework, Reliability Pillar.

Impact on the design decision

The following are the primary aspects that impact the design decision:

The required level of availability: availability of a connection to AWS needs to be evaluated based its level of criticality to the business with regard to the connected on-premises site, criticality of the resources access in AWS site and the impact magnitude of a downtime on the business.

The required level of resiliency: evaluate the impact of a downtime duration. There are use cases that demand minimum downtime, example include financial services, and critical infrastructure management services.

Similarly, if the redundant network component (link, network devices, etc.) is not reliable enough to provide the expected function on its own, such as connection performance, then, this means low resiliency to failures. The consequence is poor and degraded user experience.

Requirements definition

  • What is the impact magnitude on the business in case of a connectivity failure to AWS?

  • What is the required level of availability (up time target e.g. 99.9%) per on-premises site that need to be connected to AWS?

  • From business point of view, is the cost following a connectivity failure to AWS, overweighs the cost of deploying a highly reliable connectivity model to AWS?

  • Is the cost in anyway the primary limiting factor to the business?

Technical Solution

Reliability of a hybrid network connection, depends on several factors, such as security, operational excellence, etc. however, the primary of which is resiliency. Therefore, resiliency is the primary focus of this section.

First, it is important to note, that not every hybrid network connectivity solution, or business requires high level of reliability. Therefore, it was explicitly mentioned earlier to evaluate first, in order to be able to identify where high level of reliability is needed. In some scenarios, a primary site may require reliable (redundant and resilient) connections as the downtime has higher impact on the business, while regional sites, may not require same level of reliability due to the lower impact on the business in case of a failure event. It is recommended to refer to the AWS Direct Connect Resiliency Recommendations as it explains the AWS best practices for ensuring high resiliency with AWS Direct Connect design.

To achieve a reliable hybrid network connectivity solution in the context of resiliency, the design needs to take into consideration the following aspects:

  • Redundancy: aims to eliminate any single point of failure in the hybrid network connectivity path, including but not limited to; network connections, edge network devices, redundancy across Availability Zones, AWS Regions, and DX locations, as well as power sources to the devices, etc. For the purpose and scope of this whitepaper redundancy focuses on the network connections, edge devices (e.g. customer gateway devices), AWS DX location, and AWS Regions (for multi-Region architectures).

  • Reliable failover components: in some scenarios a system might be available (functioning) but it is not performing its functions at the minimum required level. From operational quality point of view, the system not delivering the intended service or performing a function reliably even it is technically available. Such situation is very common, following a failover event, to a redundant component (e.g. redundant network link). This is where a reliable redundant component can help to mitigate or eliminate such situation and make the overall design more resilient.

  • Failover time: this is another key aspect of resiliency, because you may have a redundant component or connection, however, following a failure event of the primary component, it takes long time for the secondary component to start taking over, or even requires manual configuration update by administrator. This typically leads to a less resilient solution even though there is redundant component in the design. as part of the hybrid network connectivity, one of the key timers cloud and network architect need to take into consideration, is the failure detection time; this is a very important aspect since you may have a routing design or protocol tuned to failover very quickly, however, this failover operation won’t be triggered fast if the failure itself is not reported fast enough to the routing protocol. There are many techniques to speed up the failure detection depending on the connectivity type, physical medium etc. With AWS hybrid connectivity, if you are using VPN you might need to look into VPN dead peer detection and if you are working with AWS DX connection, you need to look at Bidirectional Forwarding Detection (BFD), which helps to achieve a faster routing re-convergence time, and will result in a higher solution resiliency. In addition, you need to take into consideration the on-premises network side, because if there is any routing protocol deployed, peering with multiple CGWs, this routing protocol needs to converge as well (each routing protocol has its own convergence speed, and recommended tuning timers to achieve fast and stable convergence time).

  • Traffic Engineering: traffic engineering in the context of resilient hybrid network connectivity design, aims to address how traffic should flow over multiple available connections in normal and failure scenarios. Therefore, it is recommended to follow the concept of ‘design for failure’, where you need to look at how the solution will operate in different failure scenarios and whether it will be acceptable by the business or not. This section discusses some of the common traffic engineering uses case that aims to enhance the overall resiliency level of the hybrid network connectivity solution. To design an effective traffic engineering solution, you need to have a good understanding of how each of the AWS networking components (AWS VPC and gateways), handle IP routing in terms route evaluation and selection as well as the possible mechanisms to influence the route selection. The figure below illustrates a summarized version of how the IP routing decision is made at different level within AWS from the VPC to the networking gateways components.

Figure 12 – Routing evaluation within Amazon VPC and associated gateways

Note

In the VPC route table you might reference prefix list which has addition route selection rules, refer to route priority for prefix lists for more information about this use case. AWS Transit Gateway route table also supports prefix lists but once applied they get expanded to specific route entries.