Solution components - Disaster Recovery for AWS IoT

Solution components

Provisioning options

This solution covers aspects of the device layer and replicates device settings from a primary to a secondary Region. AWS IoT offers multiple options to provision devices. IoT policies can be attached to device certificates or to IoT thing groups. IoT thing groups can be nested. The solution supports the following provisioning options.

  • Replicating devices which have a certificate attached. An IoT policy is attached to the device certificate.

  • Replicate device deletion from primary to secondary Region. If a device shadow exists, it will also be deleted.

  • Syncing the device unnamed shadow from the primary to the secondary Region.

  • Region-to-Region sync for existing setups. If you already have devices before you install the solution, you can use the sync to replicate devices from the primary to the secondary Region.

  • Route 53 health checks in the primary and secondary Regions. They check the health of the MQTT message broker.

  • Based on Route 53 health checks, you can create traffic policies for failover for your solution. For more information, refer to Amazon Route 53 .

Amazon DynamoDB expiring items

Items stored in DynamoDB are copied from one Region to another. After they have arrived and processed in the secondary Region, they are no longer needed. To reduce the number of items stored in DynamoDB, they are expired by the DynamoDB Time to Live (TTL) feature. The expires attribute defines when items can be removed. The expires attribute for the global DynamoDB table is defined in the IoT rules by (timestamp()/1000)+172800 as expires.The number 172800 refers to when the item expires, which is after 48 hours. You can change this setting in the CloudFormation templates.

Region syncs

If you already have devices provisioned in your account, you have to replicate them manually after launching this solution. After the solution has been implemented, only newly created devices will be synced to the secondary Region. To sync all existing devices to the secondary Region, Regions syncs are provided.

These Region syncs and the Lambda function code have been copied to an S3 bucket in the primary Region. You can find the S3 location in the outputs section of your main CloudFormation stack under LambdaS3Url. The Region syncs are located in the subfolder iot-dr-region-syncer.

There are two types of Region syncs:

  • region-to-region

  • region-to-ddb

The Region syncs get their settings from environment variables.

Environment variables for the region-to-region sync:

  • PRIMARY_REGION: primary Region

  • SECONDARY_REGION: secondary Region

  • SYNC_MODE: default ‘smart’: In smart mode, a device will not be synced if the IoT thing name can be found in the primary Region

  • QUERY_STRING: query string to look up devices, defaults to “thingName:*”

  • MAX_WORKERS: number of parallel threads to run, defaults to 10

Environment variables for the region-to-ddb sync:

  • PRIMARY_REGION: primary Region

  • SECONDARY_REGION: secondary Region

  • SYNC_MODE: default ‘smart’: In smart mode a device will not be synced if the thing name can be found in the primary Region

  • QUERY_STRING: query string to look up devices, defaults to “thingName:*”

  • DYNAMODB_GLOBAL_TABLE: Name of the global Dynamo DB table where to put device information from the primary Region. You can find the name of the global DynamoDB table in the Outputs section of the CloudFormation stack in the primary Region under GlobalDynamoDBTableName

Both syncs get the devices from the primary Region from the registry index if indexing is turned on, otherwise by list-things. We recommend using registry indexing. If registry indexing is not turned on, the query string will be ignored.

The syncs are implemented in python and were tested with Python 3.7. They require the boto3dynamodb-json, and device_replication.py libraries to be available on the system. boto3 and dyanmodb-json can be installed with pip. The device_replication.py library can be found in the S3 bucket in the primary Region in the subfolder iot-dr-layer.

Run Region syncs

Region syncs can be run standalone, for example, on an AWS Cloud9 environment or an EC2 instance. They can also be run as Lambda functions; however, Lambda functions are limited to a runtime of 15 minutes. Syncs can also be run in a Docker container. To run the syncs in a serverless container environment, use AWS Fargate. If you use Fargate you must provide an IAM role for your task definition. For an example IAM role, refer to IAM Roles.

region-to-region

The region-to-region sync iot-region-to-region-syncer.py gets devices from the primary Region and provisions these devices in the secondary Region.

To build a container, you can use the script build-docker-image-r2r.sh. This script builds a Docker image and uploads the image to Amazon Elastic Container Registry (Amazon ECR). Prior to uploading the image to Amazon ECR, you must create a repository and modify build-docker-image-r2r.sh to reflect your settings. You must also change the AWS_REGION and AWS_ACCOUNT to reflect your settings.

region-to-ddb

The region-to-ddb sync gets the devices from the primary Region, converts them into the format as registry events messages, and then stores them into the global DynamoDB table. This data is used to sync devices from the primary to the secondary Region. Devices in the secondary Region are created by the Step Functions workflow. 

The sync has been tested standalone on Cloud9, but should also be able to run in a Docker container or as Lambda function. To build a container use the script build-docker-image-r2r.sh and modify it to reflect your settings.

Example test results from sync runs

Region-to-region:

  • Created 2,400 devices on Fargate. Fargate provisioned in region eu-west-2, primary region eu-central-1, secondary Region eu-north-1. MAX_WORKERS set to 20. This takes approximately 10 minutes to replicate devices.

Region-to-DDB:

  • Scanned 2,400 devices and store 1,000 into DynamoDB. This takes approximately two minutes.

  • All 1000 devices found in secondary Region. This takes approximately three minutes.

Amazon Route 53

Health checks from Route 53 and DNS failover can be used for failover from one Region to another. You can create a CNAME based on a traffic policy and based on health checks, the CNAME will resolve the IoT endpoint of a healthy Region. Your devices can then connect to the CNAME resolution.

Route 53 health checks are created by CloudFormation templates in the primary and secondary Regions. You can find the stacks in each Region and named R53HealthCheckerYYYYMMDDhhmmss.

Create a traffic policy

To create a traffic policy with a CNAME, you must have your own domain. Use the following instructions to create a traffic policy for the iot-dr-us.example.com CNAME.

  1. Sign in to the Amazon Route 53 management console.

  2. Select Traffic policies.

  3. Choose Create traffic policy.

  4. In Policy name, use the example: IoTDR-primary-region-secondary-region

  5. Choose Next.

  6. In DNS type, select CNAME: Canonical name

  7. Select + Connect to... and select Failover rule.

  8. Under Primary > Health checks, check the box next to Evaluate target health.

  9. Select + Connect to... and select New endpoint

  10. In the Value field, enter your IoT endpoint from primary Region. For example, 1111111bbbbbbb-ats.iot.us-east-1.amazonaws.com

  11. Under Primary > Health checks, check the box next to Evaluate target health.

  12. Select + Connect to... and select New endpoint

  13. In the Value field, enter your IoT endpoint from secondary Region. For example, 1111111bbbbbbb-ats.iot.us-west-2.amazonaws.com

  14. Choose Create traffic policy.

  15. Select a hosted zone from the domains you have registered. For example: example.com

  16. For the Policy record DNS name, use the example: iot-dr-us

  17. Choose Create policy records.

Disaster Recovery for AWS IoT Amazon Route 53
             traffic policy

Figure 2: Disaster Recovery for AWS IoT Amazon Route 53 traffic policy