Architecture overview - Disaster Recovery for AWS IoT

Architecture overview

Deploying this solution with the default parameters builds the following environments in the AWS Cloud:

  • A setup to replicate device settings and shadows from the primary Region to the secondary Region.

  • Amazon Route53 health checks that can be used for failover.

Apart from the deployments, this solution also provides a Region-to-Region sync to replicate existing devices.

This solution works in two AWS Regions, which are called primary and secondary Regions. Devices are created in the primary Region. Device settings and shadows in the primary Region are replicated in the secondary Region.

Note

When you create devices in the secondary Region, they will not be replicated in the primary Region.

If the primary Region is not available, devices failover to the secondary Region.

Disaster Recovery for AWS IoT architecture diagram

Figure 1: Disaster Recovery for AWS IoT solution architecture

In each Region, AWS CloudFormation templates deploy AWS IoT rules, AWS Lambda functions, an Amazon DynamoDB table, and AWS Identity and Access Management (IAM) roles necessary for device replication settings and to sync the shadow. In the secondary Region, an AWS Step Function setup is deployed additionally.

Replication flow

  1. After the AWS CloudFormation templates have created a DynamoDB table in each of the Regions, these tables will be configured as one global table. You must turn on registry events in the primary Region.

  2. The registry publishes event messages when IoT things, thing types, and thing groups are created, updated, or deleted. A topic rule forwards these messages to the DynamoDB table in the primary Region. They are automatically replicated to the table in the secondary Region.

  3. DynamoDB streams captures the data on arrival in the secondary Region and invokes a Lambda function (Dynamo trigger).

  4. The Dynamo trigger Lambda function initiates a Step Functions workflow to forward the related event types to another Lambda function.

  5. The related Lambda function creates, updates or deletes several aspects of IoT things, thing groups, and thing types.

  6. The Step Functions workflow creates or updates IoT things in the secondary Region. The Step Functions setup also includes retry rules to handle errors.

By default, X.509 certificates for device authentication are provided by AWS IoT Core. AWS Certificate Manager Private Authority (ACM PCA) can be used to optionally issue device certificates if you choose to use your own Certificate Authority (CA). ACM PCA is not deployed automatically with CloudFormation templates. Jupyter notebooks are provided with the solution to deploy ACM PCA in a Region where the service is available and to register the root CA in the primary and secondary Regions. There is no need to deploy ACM PCA in either the primary or secondary Region.

The device shadow sync is also deployed with the AWS CloudFormation templates for the primary and secondary Regions. The shadow sync architecture is based on the architecture for device replication. It uses the same global DynamoDB table and the same Step Functions workflow. Shadow messages that are accepted by AWS IoT Core are published on the shadow update/accepted topic. Messages are stored with an IoT rule in the global DynamoDB table. These messages are forwarded to the ChoiceEventType Step Function workflow and routed to a Lambda function, which updates the device shadow in the secondary Region.

Failover flow

A separate set of AWS CloudFormation templates creates health checks that can be used by Amazon Route 53 in the primary and secondary Regions.

  1. Amazon Route 53 with health checks and traffic policies can be used for a Region failover. For more information about failover options, refer to Solution components. Amazon Route 53 currently only supports HTTP(s) or TCP health checks. This solution uses the health of the Message Queuing Telemetry Transport (MQTT) message broker from AWS IoT Core.

  2. CloudFormation templates deploy an Amazon API Gateway resource, which calls a Lambda function. This Lambda function is configured as a device in AWS IoT Core. When invoked, the Lambda function connects to AWS IoT Core, and subscribes to a topic and publishes a configured number of messages. The Lambda function expects to receive the same number of messages to the topic it has subscribed to.

  3. Amazon Route 53 health checks calls the API Gateway resource and tests the MQTT message broker implicitly. As a layer of security, the Lambda function receives a query string before it connects to the message broker. If the query string does not match, the Lambda function issues an error message. The expected query string is configurable.