Design considerations - Disaster Recovery for AWS IoT

Design considerations

IoT solution layers

A complete IoT setup or solution can consist of several layers. A disaster recovery setup affects every layer.

Device Layer

The device layer contains the following resources required to connect devices to AWS IoT Core:

  • DNS (Amazon Route 53 )

  • IoT Endpoint

  • Device registry

  • Device certificates

  • IoT policies

  • Device shadows

Device settings required to connect devices to AWS IoT Core are replicated from a primary to a secondary Region. The solution provides two approaches to replicate devices, JITR- and complete mode.

JITR-mode

In JITR-mode, certificates are automatically registered and a device is being provisioned when it connects for the first time to an AWS IoT endpoint. With JITR-mode, only device registry settings will be replicated. When a device fails over to a secondary Region the certificate will be registered automatically and an IoT policy will be attached to the device certificate.

Complete provisioning mode

Upon device creation, the IoT thing and the associated certificates and policies will be replicated to the secondary Region. For registering certificates in the secondary Region, the multi-account registration feature (MAR) will be used. Certificates will be retrieved from the primary Region and registered in the secondary Region. Certificates in the primary Region are either issued by AWS IoT Core or by another CA like ACM PCA.

Registry events in the primary Region are used to track creation of devices. Registry information for IoT things, thing-groups and thing-types are replicated with a DynamoDB global table to the secondary Region. IoT things, thing-groups, or thing-types are then created in the secondary Region.

Registry events do not publish messages if a policy or certificate is being created. To retrieve this information, the certificate attached to a IoT thing and the policy attached to the certificate are retrieved from the primary Region after the device has been duplicated in the secondary Region.

A scheduled script will be run that determines if a cert/policy has been attached to the device. Devices with attached certificate and policy will be provisioned.

This script can be executed standalone in a container, for example, on AWS Fargate or as a Lambda function. To provision multiple devices, you can update the runtime limit of a Lambda function to restrict the usage of the Lambda function. It will run until all devices are provisioned.

Registry events constraints

  • Merging information (true|false) for attributes for IoT things or thing groups are not provided as part of registry events

  • Remove thing type information is not provided by registry events

Shadows

Shadows are replicated from the primary to the secondary Region, as shown in Figure 1.

Infrastructure layer

The infrastructure layer of an IoT solution consists of resources, such as IoT rules, Certificate Authorities, or settings. These resources are automatically deployed when you launch the solution in both Regions.

Storage/analytics layer

IoT data are stored or analyzed in the storage/analytics layer. This layer is automatically created when you launch the solution.

The infrastructure and storage/analytics layers are not automatically replicated because different disaster recovery scenarios can be created. You can create limitations with the same functionality in both Regions, or you can use the secondary Region to store your data in case of a disaster and merge it back later to the primary Region.

Performance testing

Before launching this solution, test the performance to ensure that service quotas are handled automatically. If service quotas are not handled, you can request a limit increase.

Before setting up failover, verify that the number of devices connected to the primary Region can also be used the secondary Region. We recommend working with your AWS Support team to have resources available in your primary and secondary regions.

Device replication has been tested with 30,000 devices and shadow replication has been tested with 20,000 shadows.

Device replication with bulk provisioning

To test replicating device settings from the primary to a secondary Region, bulk registration with 30,000 devices is used. To replicate devices, the solution uses the following Lambda functions:

  • DynamoTrigger function: This function reads messages from the DynamoDB stream and invokes the IoTDRSecondaryTimestamp-SFNDynamoTriggerLambdaFu-UniqueString Step Functions workflow.

  • ThingCrudLambda function: Replicates device settings in the IoTDRSecondaryTimestamp-SFNThingCrudLambdaFuncti-UniqueString Step Functions workflow.

Monitor these Lambda functions for errors and also for concurrent invocations. Errors for the ThingCrud functions are handled by the retry configuration of the Step Functions workflow.

During the performance test, some API limits are throttled due to the parallel invocation of Lambda functions. This does not impact the replication of the devices. However, depending on the results of tests in your environment, you might want to consider requesting a quota increase.

For more information about running these Lambda functions, refer to Analyzing Log Data with CloudWatch Logs Insights in the Amazon CloudWatch Logs User Guide. The following list of example filters can be used to get more insights into the ThingCrud function:

  • API throttling:

    filter @message =~ filter @message =~ "ThingCrudException: lambda_handler: An error occurred (ThrottlingException)"
  • Runtime and memory used: filter @message =~ "REPORT"

  • Timed out Lambda function: filter @message =~ "Task timed out after"

  • Use this filter to for all of the above:

    filter @message =~ "ThingCrudException: lambda_handler: An error occurred (ThrottlingException)" or @message =~ "Task timed out after" or @message =~ "REPORT"

Figure 3 displays the AWS IoT bulk provisioning results, which include:

  • number of things for bulk provisioning: 30000

  • time to generate keys and CSRs: 11086 secs.

  • time for bulk provisioning: 6146 secs.

Disaster Recovery for AWS IoT monitoring DynamoTrigger Lambda function

Figure 3: Disaster Recovery for AWS IoT monitoring DynamoTrigger Lambda function

Figure 3 displays typical behavior for the DynamoTrigger Lambda function when 30,000 devices are replicated.

Disaster Recovery for AWS IoT monitoring ThingCrud
            Lambda function

Figure 4: Disaster Recovery for AWS IoT monitoring ThingCrud Lambda function

Refer to Figure 4 for an example of monitoring the ThingCrud Lambda function when replicating 30,000 devices. The errors displayed are caused by reaching API limits where they are handled by the Step Functions retry workflow.

Shadow replication

Replicating the shadow has been tested with 20,000 device shadows with the tool iot-dr-shadow-cmp.py by using 40 parallel threads.

Results:

./iot-dr-shadow-cmp.py --primary-region us-east-1 --secondary-region us-west-2 --num-tests 20000 --max-workers 40 |tee /tmp/sdw20k.log 2020-11-20 12:49:03,908 [INFO]: MainThread-iot-dr-shadow-cmp.py:156-<module>: cmp: start 2020-11-20 13:02:47,887 [INFO]: MainThread-iot-dr-shadow-cmp.py:221-<module>: cmp: stats: NUM_SHADOWS_COMPARED: 20000 NUM_SHADOWS_NOTSYNCED: 0 NUM_ERRORS: 0 2020-11-20 13:02:47,887 [INFO]: MainThread-iot-dr-shadow-cmp.py:223-<module>: cmp: stop

Testing tools

This solution uses testing tools that are copied to an S3 bucket in the primary Region. You can find the S3 location in the outputs section of your main CloudFormation stack under ToolsS3Url.

The tools are implemented as Bash scripts or in Python3 and have been tested on an Amazon Cloud9 environment. You can use them in any environment where Bash and Python3 is available. The Python scripts use the boto3 library or the AWS IoT SDK v2, which must be installed on your system.

If you want to capture the output of scripts not only to terminal but also to a file, use the tee command. For example: $ ./script-name | tee /tmp/my.log.

These test tools require setting environment variables. Under ToolsS3Url, you will find toolsrc file with several predefined environment variables. You can customize it to your needs and set the environment variables before using the tools.

The following scripts are provided to create, delete, and test devices.

Note

When you use these scripts to create devices you want replicated in the secondary Region, you must create the device in the primary Region. Only devices created in the primary Region will be replicated to the secondary Region.

  • bulk-bench.sh: Creates a certain number of devices with bulk provisioning.

    • Prerequisites:

      • An IAM role that permits AWS IoT to provision devices on your behalf. The role ARN must be set to the environment variable ARN_IOT_PROVISIONING_ROLE. A role has been created already with CloudFormation during solution launch.

      • An S3 bucket that you can use for bulk provisioning. The bucket name must be set to the environment variable S3_BUCKET.

      • You can find both environment variables in the file toolsrc.

  • $ ./bulk-bench.sh <base_thingname> <num_things>: The script creates a directory to store keys and CSRs. An IoT thing name is composed out of the base_thingname and an ongoing number. The name of the directory is base_thingname-%Y-%m-%d_%H-%M-%S.

    • After successful bulk provisioning, the script will download the resulting JSON file containing device certificates. The name of the JSON file is base_thingname-%Y-%m-%d_%H-%M-%S/results.json.

  • bulk-result.py: This script extracts device certificates issued by bulk-bench.sh and writes them to the file system.

    • Run $ bulk-result.py results.json

  • script-name <THING_NAME>: The following scripts create devices. They use the AWS CLI to provision devices. Some of the scripts will expect shell variables to be set. The scripts write device certificates and private and public keys to $THING_NAME.certificate.pem,$THING_NAME.private.key, $THING_NAME.public.key. By default, devices are created in the Region of your AWS CLI configuration. If you want to create devices in another Region, set the environment variable AWS_DEFAULT_REGION to the appropriate Region.

    • create-device-attrs-type.sh: creates a device with attributes and the thing-type dr-type03. You must create the thing-type before using the script.

    • create-device-attrs.sh: Creates device with attributes.

    • create-device-pca.sh: Creates a device with a device certificate issued by AWS Certificate Manager Private Certificate Authority (ACM PCA). A PCA can be created with the Jupyter notebooks provided by the solution. Environment variable: PCA_ARN must be set to the ARN of you private CA.

    • create-device-type.sh: Creates a device with the thing-type dr-type03. You must create the thing-type before using the script.

    • create-device.sh: Creates a device using the IoT policy from the document sample-pol1.json. The script replaces the Region in the sample-pol1.json policy. Environment variable REGION must be set to the AWS Region that you want to use.

    • Always create devices in the primary Region.

  • delete-things.py

    • Prerequisite: Registry indexing must be turned on.

    • Run:

      $./delete-things.py –region <your_region_where_devices_should_be_delted> --query-string “thingName:<device_pattern>”
    • For the IoT thing name, you can provide a single ioT thing name or use the “*” wildcard in combination with the part of a device name to match multiple devices.

    • Deletes devices that are matched by the query string. The certificate will be detached from the device. The policy associated with the certificate will only be deleted if no other principals are associated with the policy. The certificate will be deleted if no other things are associated with the certificate.

    • Always delete devices in the primary Region.

  • iot-devices-cmp.py

    • Prerequisite: Registry indexing must be enabled

    • Run:

      $ ./iot-devices-cmp.py –primary-region <your_primary_region> --secondary-region <your_secondary_region> --query-string “thingName:devicename*”

      Query-string is optional and defaults to “thingName:*”

    • Compares devices in the primary and secondary region to verify the same certificate and policy is attached in both regions to the certificate. The script is meant to work with one certificate and one policy attached to the certificate which has been replicated. It is not a general tool to cover multiple certificates or multiple policies attached to a device or certificate.

  • iot-dr-pubsub.py: Based on the pubsub.py example from the AWS IoT Device SDK v2 for Python with two additional features.

    • Prerequisites: Optional Amazon Route 53 setup with health checkers and traffic policy. See description below in this document. The Python libraries awsiotsdk and dnspython must be installed.

    • Optional features:

      • --cname: instead of connecting directly to the provided endpoint do a CNAME lookup in DNS and connect to the resulting host name.

      • --dr-mode: Starts a separate thread which does a CNAME lookup regularly. If the result of the CNAME lookup changes it terminates the current connection to an IoT endpoint and reconnects to the new endpoint.

    • Run:

      $ ./iot-dr-pubsub.py --endpoint <endpoint> --root-ca <file> --cert <file> --key <file> <optional_features>
    • Apart from publishing/subscribing the script can look up the CNAME of an IoT endpoint and connect to the result of the CNAME lookup. When the CNAME is created with a traffic policy in Amazon and –dr-mode is turned on, the script will failover to another Region in case health checks determine that the current active Region has failed. By using a TXT record in DNS, the script can determine if it is connected to a primary or secondary Region. To use this feature, create a DNS TXT record which points to _YOUR_CNAME and provide a JSON object in the following format: {“primary”: “primary_region”, “secondary”: “secondary_region”}. A sample lookup will result in the following answer:

      host -t TXT _iot-dr-us.example.com _iot-dr-us.example.com descriptive text "{\"primary\": \"us-east-1\", \"secondary\": \"us-west-2\"}"
  • iot-dr-shadow-cmp.py

    • Run:

      $ ./iot-dr-shadow-cmp.py –primary-region <region> --secondary-region <region> --num-tests <number_of_shadows_to_test> --max-workers <default_10_max_50>
    • Test shadow replication. Shadows are created in the primary Region and compared with the shadow in the secondary Region to determine if shadow replication has been successful. After testing shadows will be deleted. Works parallelized to speed up runtime.

  • iot-search-devices.py

    • Prerequisites: Registry indexing must be turned on.

    • Run $ ./iot-search-devices.py –query-string <query_string>

    • Searches all devices for a given query string. Not limited to a number of devices as it makes use of the next_token in an answer and continues to get devices. Useful to compare the number of devices in regions. Query string example to find all device which name starts with iot-dr: “thingName:iot-dr*”

  • list-thing.py

    • Run $ ./list-thing.py <thing_name>

    • Looks up the given device name and the attached certificate and iot policy

  • sample-pol1.json, sample-pol2.json

    • Sample IoT policies

  • test-dr-deployment.py

    • Run $ ./test-dr-deployment –primary-region <region> --secondary-region <region>

    • An automated end-to-end test to show the working capabilities of the IoT DR Solution. It creates a device in the primary Region and verifies if the replication to the secondary Region has been successful. Afterwards it tests publish and subscribe in both Regions. Afterwards it tests shadow synchronization. When tests have been finished it deletes the resources that have been created.

Failover

This section discusses possible approaches for Region failover.

Use your own domain with CNAME

To use your own domain as CNAME pointing to AWS IoT endpoints, you can use Route 53 health checks for failover or implement a failover logic on your devices.

Route 53 health checks

Health checks are initiated from several AWS Regions, but they cannot determine if your devices are able to reach AWS IoT endpoints. There might be cases where your devices are not able to reach a Region, but health checks can. 

Failover logic on devices

You can configure both endpoints for the primary and secondary Region on your devices. If a device cannot reach the primary Region, it can switch to the secondary Region. This device perspective approach always looks for your IoT endpoints.

Failback

Devices using a permanent MQTT-based connection must implement a failback strategy. Without failback strategy, they will stay connected to the failover Region even when the primary Region is already up again. If your devices detect when they are connected to a secondary Region, they can test in regular intervals if the primary Region is reachable again and fail back.

The sample iot-dr-pubsub.py implements a strategy to failover as soon as an endpoint change has been detected. If you use any mechanism on your devices to determine to which Region a device is connected, you can build your reconnection strategy upon such a mechanism. For more information, refer to Testing tools.

Customizing this solution

This solution is mainly built on AWS IoT registry events to capture and replicate IoT thing, thing-group, or thing-type related settings. It does not cover certificates attached to IoT thing groups or nested thing groups or jobs. Also, by using registry events, it is not possible to track changes at device certificates or IoT policies.

To capture certificate and policy changes, use Amazon EventBridge to replicate them.

Replicating jobs requires some more investigations to determine if and how they can be replicated and handled in case of failover.

Regional deployments

This solution uses AWS IoT Core and Amazon DynamoDB. Both services must be available in the Regions where you deploy the solution. The CloudFormation template allows you to choose only Regions where these services are available.

If your AWS IoT environment also uses other services, you need to select an AWS Region where all of your services are available. For the most current availability by Region, refer to the AWS Service Region Table.