Multi-Region fundamental 3: Understanding your workload dependencies - AWS Multi-Region Fundamentals

Multi-Region fundamental 3: Understanding your workload dependencies

A specific workload might have several dependencies in a Region, such as AWS services used, internal dependencies, third-party dependencies, network dependencies, certificates, keys, secrets, and parameters. To ensure operation of the workload during a failure scenario, there should be no dependencies between the primary Region and the standby Region; each should be able to operate independently of one another. To achieve this, all dependencies in the workload must be scrutinized to ensure they are available within each Region. This is required because a failure in the primary Region should not have an impact in the standby Region. In addition, knowledge of how the workload operates when a dependency is in a degraded state or completely unavailable is imperative, so that solutions can be engineered to handle this appropriately.

3a: AWS services 

When designing a multi-Region architecture, an understanding of the specific AWS services that will be used is necessary. The first aspect is understanding what features the service has to enable multi-Region, and if a solution must be engineered to accomplish the multi-Region goals. For example, with Amazon Aurora and Amazon DynamoDB, there is a feature to asynchronously replicate data to a standby Region. Any AWS service dependencies will need to be available in all Regions that a workload is going to run from. To ensure the services that will be used are available in the desired Regions, review the AWS Regional Services List

3b: Internal and third-party dependencies

For any internal dependencies that a workload has, ensure it’s available from the Regions the workload will operate out of. For example, if the workload is composed of many microservices, be knowledgeable about all of the microservices that comprise a business capability. From there, ensure that all of those microservices are deployed in each Region the workload will operate out of.

Cross-Region calls between microservices within a workload is not advised, and Regional isolation should be maintained. This is because creating cross-Region dependencies adds risk of correlated failure, which negates the benefits you are trying to achieve with isolated Regional implementations of the workload. On-premises dependencies might be part of the workload as well, so understanding how characteristics of these integrations could change if the primary Region was to change is imperative. For example, if the standby Region is located farther from the on-premises environment, the increased latency will have a negative impact.

Understanding Software as a Service (SaaS) solutions, software development kits (SDKs), and other third-party product dependencies, and being able to exercise scenarios where these dependencies are either degraded or unavailable will provide more insight into how the chain of systems operates and behaves under different failure modes. These dependencies could be within an application code fromhow secrets are managed externally using AWS Secrets Manager, or a third-party vault solution (such as Hashicorp),to authentication systems having a dependency onIAM Identity Center for federated logins.

Having redundancy when it comes to dependencies can aid in increased resilience. There is also the possibility that a SaaS solution or third-party dependency is using the same primary AWS Region as the workload. If this is the case, you should work with the vendor to determine if their resilience posture matches requirements for the workload.

Additionally, be aware of shared fate between the workload and its dependencies, such as third-party applications. If the dependencies are not available in (or from) a secondary Region after a failover, the workload might not recover fully.

3c: Failover mechanism

The Domain Name System (DNS) is commonly used as a failover mechanism to shift traffic away from the primary Region to a standby Region. Critically review and scrutinize all dependencies the failover mechanism takes. For example, if your workload is using Amazon Route 53, understanding that the control plane is hosted in US-East-1 means you are taking a dependency on the control plane in that specific Region. This is not recommended as part of a failover mechanism if the primary Region is US-East-1 as well. If another failover mechanism is being used, a deep understanding of any scenario in which it wouldn’t operate as expected is necessary. Once this understanding is established, plan for contingency or develop a new mechanism if required. Review Creating Disaster Recovery Mechanisms Using Amazon Route 53 to learn about approaches you can use to failover successfully.

As discussed in the internal dependency section, all microservices that are part of a business capability need to be available in each Region in which the workload is deployed. As part of the failover strategy, the business capability needs to failover together to remove the chance of cross-Region calls. Alternatively, if microservices failover independently, this introduces the potential for undesirable behavior where microservices potentially make cross-Region calls, which introduces latency and could lead to the workload being unavailable in the event of client timeouts.

3d: Configuration dependencies

Certificates, keys, secrets, and parameters are part of the dependency analysis needed when designing for multi-Region. Whenever possible, it’s best to localize these components within each Region so they do not have shared fate between Regions for these dependencies. For certificates, expiration should vary among them, and if possible, in each Region, to avoid a scenario when an expiring certificate (with alarms set to notify in advance) impacts multiple Regions.

Encryption keys and secrets should be Region-specific as well. That way, if there is an error in rotation of a key or secret, the impact is limited to a specific Region.

Lastly, any workload parameters should be stored locally for the workload to retrieve in the specific Region.

Key guidance

  • A multi-Region architecture benefits from physical and logical separation between Regions. Introducing cross-Region dependencies at the application layer breaks this benefit. Avoid such dependencies.

  • Failover controls should work with no dependencies on the primary Region.

  • Coordinating failover at the business capability needs to be done to remove the possibility of increased latency and dependency of cross-Region calls.