Centralized egress to internet - Building a Scalable and Secure Multi-VPC AWS Network Infrastructure

Centralized egress to internet

As you deploy applications in your Landing Zone, many apps will require outbound only internet access (for example, downloading libraries/patches/OS updates). You can achieve this preferably by using a network address translation (NAT) gateway, or alternatively an EC2 instance (configured with Source NAT(SNAT)) as the next hop for all egress internet access. Internal applications reside in private subnets, while NAT Gateway/EC2 NAT instances reside in a public subnet.

Using the NAT Gateway

Deploying a NAT Gateway in every spoke VPC can become expensive because you pay an hourly charge for every NAT Gateway you deploy (see Amazon VPC pricing), so centralizing it could be a viable option. To centralize, we create an egress VPC in the network services account and route all egress traffic from the spoke VPCs via a NAT Gateway sitting in this VPC leveraging Transit Gateway, shown in Figure 10.

Note: When you centralize NAT Gateway using Transit Gateway, you pay an extra Transit Gateway data processing charge — compared to the decentralized approach of running a NAT Gateway in every VPC. In some edge cases when you send huge amounts of data through NAT Gateway from a VPC, keeping the NAT local in the VPC to avoid the Transit Gateway data processing charge might be a more cost-effective option.

Figure 10 – Centralized NAT Gateway using Transit Gateway (Overview)

Figure 11 – Centralized NAT Gateway using Transit Gateway (Route table design)

In this setup, spoke VPC attachments are associated with Route Table 1 (RT1) and are propagated to Route Table 2 (RT2). We have explicitly added a Blackhole route to disallow the two VPCs from communicating with each other. If you want to allow inter-VPC communication, you can remove the ‘10.0.0.0/8 -> Blackhole’ route entry from RT1. This allows them to communicate via the NAT Gateway. You can also propagate the spoke VPC attachments to RT1 (or alternatively, you can use one route table and associate/propagate everything to that), enabling direct traffic flow between the VPCs using Transit Gateway.

We add a static route in RT1 pointing all traffic to egress VPC. Because of this static route, Transit Gateway sends all internet traffic through its ENIs in the egress VPC. Once in the egress VPC, traffic follows the rules defined in the subnet route table where these Transit Gateway ENIs are present. We add a route in this subnet route table pointing all traffic towards NAT gateway. The NAT gateway subnet route table has internet gateway (IGW) as the next hop. In order for return traffic to flow back, you must add a static route table entry in the NAT gateway subnet route table pointing all spoke VPC bound traffic to Transit Gateway as next hop.

High availability

For high availability, you should use two NAT gateway’s (one in each AZ). Within an AZ, the NAT gateway has an availability SLA of 99.9%. Redundancy against component failure within an AZ is handled by AWS under the SLA agreement. Traffic is dropped during the 0.1% time when the NAT gateway may be unavailable in an AZ. If one AZ entirely fails, the Transit Gateway endpoint along with NAT gateway in that AZ will fail, and all traffic will flow via the Transit Gateway and NAT gateway endpoints in the other AZ.

Security

You rely on security groups on the source instances, blackhole routes in the Transit Gateway route tables, and the network ACL of the subnet in which NAT gateway is located.

Scalability

A NAT gateway can support up to 55,000 simultaneous connections to each unique destination. From a throughput standpoint, you are limited by the performance limits of NAT Gateway. Transit Gateway is not a load balancer and will not distribute your traffic evenly across NAT Gateway in the multiple AZs. The traffic across the Transit Gateway will stay within an AZ, if possible. If the EC2 instance initiating traffic is in AZ 1, traffic will flow out of the Transit Gateway elastic network interface in the same AZ 1 in the egress VPC and will flow to the next hop based on that subnet route table that elastic network interface resides in. For a complete list of rules, see NAT Gateway rules and limits.

For more information, refer to the Creating a single internet exit point from multiple VPCs Using AWS Transit Gateway blog post.

Using an EC2 instance for centralized outbound

Using a software-based firewall appliance (on EC2) from AWS Marketplace as an egress point is similar to the NAT gateway setup. This option can be used if you want to leverage the layer 7 firewall/Intrusion Prevention/Detection System (IPS/IDS) capabilities of the various vendor offerings.

In Figure 12, we replace NAT Gateway with an EC2 instance (with SNAT enabled on EC2 instance). There are few key considerations with this option:

High availability

In this setup, you are responsible for monitoring the EC2 instance, detecting failure, and replacing the EC2 instance with a backup/standby instance. Most AWS vendors have automation pre-built for their software deployed in this setup. That automation can control the following:

  • Detect failure of the primary EC2-1 instance 

  • Change the route table ‘Route Table Egx 1’ to point all traffic to the backup EC2-2 instance on the failure of primary instance. This must also be done for the subnets in the AZ 2.

Figure 12 – Centralized NAT using EC2 instances and Transit Gateway

Scalability

Transit Gateway is not a load balancer and will not distribute your traffic evenly across instances in the two AZs. The traffic across the Transit Gateway will stay within an AZ, if possible. You are limited by the bandwidth capabilities of a single EC2 instance. You can vertically scale this EC2 instance as usage increases.

If the vendor you choose for egress traffic inspection doesn’t support automation for failure detection, or if you need horizontal scaling, you can use an alternative design. In this design (Figure 13), we don’t create a VPC attachment on the transit gateway for egress VPC, instead we create an IPsec VPN attachment and create an IPsec VPN from Transit Gateway to the EC2 instances leveraging BGP to exchanges routes.

Advantages

  • Failure detection and re-routing of traffic handled by BGP. No VPC subnet route table automation is required.

  • BGP ECMP can be used to load balance traffic across multiple EC2 instances —horizontal scaling is possible.

Figure 13 – Centralized NAT using EC2 instances and Transit Gateway VPN

Key Considerations

  • VPN management overhead on EC2 instances

  • Bandwidth at Transit Gateway is limited to 1.25 Gbps per VPN tunnel. With ECMP Transit Gateway can support up to 50 Gbps total VPN bandwidth. VPN and packet processing capabilities of vendor appliance can be a limiting factor.

  • This design assumes that the FW EC2 instance is operating with the same elastic network interface for inbound and outbound traffic.

  • If you enable ECMP load balancing of traffic across multiple EC2 instances, you must SNAT traffic on the EC2 instance in order to guarantee return flow symmetry, which means the destination won't know the true source.