Foundations - IoT Lens

Foundations

IoT devices must continue to operate in some capacity in the face of network or cloud errors. Design device firmware to handle intermittent connectivity or loss in connectivity in a way that is sensitive to memory and power constraints. IoT cloud applications must also be designed to handle remote devices that frequently transition between being online and offline to maintain data coherency and scale horizontally over time. Monitor overall IoT utilization and create a mechanism to automatically increase capacity to ensure that your application can manage peak IoT traffic.

To prevent devices from creating unnecessary peak traffic, device firmware must be implemented that prevents the entire fleet of devices from attempting the same operations at the same time. For example, if an IoT application is composed of alarm systems and all the alarm systems send an activation event at 9am local time, the IoT application is inundated with an immediate spike from your entire fleet. Instead, you should incorporate a randomization factor into those scheduled activities, such as timed events and exponential back off, to permit the IoT devices to more evenly distribute their peak traffic within a window of time.

IOTREL 01. How do you handle AWS service limits for peaks in your IoT application?

AWS IoT provides a set of soft and hard limits for different dimensions of usage. AWS IoT outlines all of the data plane limits on the IoT limits page. Data plane operations (for example, MQTT Connect, MQTT Publish, and MQTT Subscribe) are the primary driver of your device connectivity. Therefore, it's important to review the IoT limits and ensure that your application adheres to any soft limits related to the data plane, while not exceeding any hard limits that are imposed by the data plane.

The most important part of your IoT scaling approach is to ensure that you architect around any hard limits because exceeding limits that are not adjustable results in application errors, such as throttling and client errors. Hard limits are related to throughput on a single IoT connection. If you find your application exceeds a hard limit, we recommend redesigning your application to avoid those scenarios. This can be done in several ways, such as restructuring your MQTT topics, or implementing cloud-side logic to aggregate or filter messages before delivering the messages to the interested devices.

Soft limits in AWS IoT traditionally correlate to account-level limits that are independent of a single device. For any account-level limits, you should calculate your IoT usage for a single device and then multiply that usage by the number of devices to determine the base IoT limits that your application will require for your initial product launch. AWS recommends that you have a ramp-up period where your limit increases align closely to your current production peak usage with an additional buffer. To ensure that the IoT application is not under provisioned:

  • Consult published AWS IoT CloudWatch metrics for all of the limits.

  • Monitor CloudWatch metrics in AWS IoT Core.

  • Alert on CloudWatch throttle metrics, which would signal if you need a limit increase.

  • Set alarms for all thresholds in IoT, including MQTT connect, publish, subscribe, receive, and rule engine actions.

  • Ensure that you request a limit increase in a timely fashion, before reaching 100% capacity.

In addition to data plane limits, the AWS IoT service has a control plane for administrative APIs. The control plane manages the process of creating and storing IoT policies and principals, creating the thing in the registry, and associating IoT principals including certificates and Amazon Cognito federated identities. Because bootstrapping and device registration is critical to the overall process, it's important to plan control plane operations and limits. Control plane API calls are based on throughput measured in requests per second. Control plane calls are normally in the order of magnitude of tens of requests per second. It’s important for you to work backward from peak expected registration usage to determine if any limit increases for control plane operations are needed. Plan for sustained ramp-up periods for onboarding devices so that the IoT limit increases align with regular day-to-day data plane usage.

To protect against a burst in control plane requests, your architecture should limit the access to these APIs to only authorized users or internal applications. Implement back-off and retry logic, and queue inbound requests to control data rates to these APIs.

IOTREL 02. What is the strategy for managing ingestion and processing throughput of IoT data to other applications?

Although IoT applications have communication that is only routed between other devices, there will be messages that are processed and stored in your application. In these cases, the rest of your IoT application must be prepared to respond to incoming data. All internal services that are dependent upon that data need a way to seamlessly scale the ingestion and processing of the data. In a well-architected IoT application, internal systems are decoupled from the connectivity layer of the IoT platform through the ingestion layer. The ingestion layer is composed of queues and streams that enable durable short-term storage while allowing compute resources to process data independent of the rate of ingestion.

To optimize throughput, use AWS IoT rules to route inbound device data to services such as Amazon Kinesis Data Streams, Amazon Data Firehose, or Amazon Simple Queue Service before performing any compute operations. Ensure that all the intermediate streaming points are provisioned to handle peak capacity. This approach creates the queueing layer necessary for upstream applications to process data resiliently.

IOTREL 03. How do you implement your IoT workload to withstand component and system faults?

Understanding and predicting the fault scenarios in the system helps you to architect for failure conditions and use service features to handle them. Therefore, the handling of such predicted system faults and recovering from them should be architected into the system.

Best practice IOTREL_3.1 – Use the services provided by your vendors for integration and error handling to withstand component failure

An IoT design consists of device software, connectivity and control services, and analytics services. Test the entire IoT environment for resiliency, starting with device firmware, data flow, the cloud services used, and error handling. Vendors have services integrated with each other to provide a simplified integration and fault handling.

Recommendation IOTREL_3.1.1Understand and apply the standard libraries available to manage your device firmware

  • Devices can be built on FreeRTOS, which provides connectivity, messaging, power management and device management libraries that are tested for reliability and designed for ease of use.

Recommendation IOTREL_3.1.2Use log levels appropriate to the lifecycle stage of your workload

  • AWS IoT logs can be set up per region and per account with the logging level set to DEBUG during product development phase to provide insights on data flow and resources used. This data can be used to improve the IoT system security and performance.

  • AWS IoT Secure Tunneling can be used to test and debug devices that are behind a restrictive firewall in the field.

IOTREL 04. How do you ensure that all IoT messages are processed?

Data sent from devices should be processed and stored without excessive loss. Services that queue and deliver IoT data to compute and database services should be used to ensure the processing of data. IoT devices send lots of data in small sizes without order, and the cloud application should be able to handle this.

Best practice IOTREL_4.1 – Dynamically scale cloud resources with utilization

The elastic nature of the cloud can be used to increase and decrease resources on demand. Use the ability to increase and decrease cloud resources based on data, number of messages and size of messages and number of devices.

Recommendation IOTREL_4.1.1Know the mechanisms that can be used to monitor cloud resource usage and methods to scale the resources

  • Use Amazon CloudWatch Logs to trigger based on rate of data flow to auto-scale cloud resources as needed.

  • Use AWS IoT Rules engine error actions to provision additional cloud resources and message retries as needed.

  • Examine IoT logs for errors in communicating to resources and provision resources based on that data.

  • Use AWS Lambda to automatically scale your application by running code in response to each event.

  • Use automatic scaling where possible. Kinesis Data Streams and Amazon DynamoDB are two services that provide automatic scaling.

IOTREL 05. How do you ensure that your IoT device operates with intermittent connectivity to the cloud?

IoT solution reliability must also encompass the device itself. Devices are deployed in remote locations and deal with intermittent connectivity, or loss in connectivity, due to a variety of external factors that are out of your IoT application’s control. For example, if an ISP is interrupted for several hours, how will the device behave and respond to these long periods of potential network outage? Implement a minimum set of embedded operations on the device to make it more resilient to the nuances of managing connectivity and communication to AWS IoT Core.

Your IoT device must be able to operate without internet connectivity. You must implement robust operations in your firmware to provide the following capabilities:

  • Store important messages durably offline and, once reconnected, send those messages to AWS IoT Core.

  • Implement exponential retry and back-off logic when connection attempts fail.

  • If necessary, have a separate failover network channel to deliver critical messages to AWS IoT. This can include failing over from Wi-Fi to standby cellular network, or failing over to a wireless personal area network protocol (such as Bluetooth LE) to send messages to a connected device or gateway.

  • Have a method to set the current time using an NTP client or low-drift real-time clock. A device should wait until it has synchronized its time before attempting a connection with AWS IoT Core. If this isn’t possible, the system provides a way for a user to set the device’s time so that subsequent connections can succeed.

  • Send error codes and overall diagnostics messages to AWS IoT Core.

  • Configure a AWS IoT Greengrass group to write logs to the local file system and to CloudWatch Log

Connection to the cloud can be intermittent and devices should be designed to handle this. Choose devices with firmware designed for intermittent cloud connection and that have the ability to store data on the device if you cannot afford to lose the data.

Best practice IOTREL_5.1 – Synchronize device states upon connection to the cloud

IoT devices are not always connected to the cloud. Design a mechanism to synchronize device states every time the device has access to the cloud. Synchronizing the device state to the cloud allows the application to get and update device state easily, as the application doesn’t have to wait for the device to come online.

Recommendation IOTREL_5.1.1Use a digital devices state representation to synchronize device state using the below capabilities:

  • AWS provides device shadow capabilities that can be used to synchronize device state when the device connects to the cloud. The AWS IoT Device Shadow service maintains a shadow for each device that you connect to AWS IoT and is supported by the AWS IoT Device SDK, AWS IoT Greengrass core, and FreeRTOS.

  • Synchronizing device shadows – Device SDKs and the AWS IoT Core take care of synchronizing property values between the connected device and its device shadow in AWS IoT Core.

  • AWS IoT Greengrass – AWS IoT Greengrass core software provides local shadow synchronization of devices and these shadows can be configured to sync with cloud.

  • FreeRTOS – The FreeRTOS device shadow API operations define functions to create, update, and delete AWS IoT Device Shadows.

Best practice IOTREL_5.2 – Use device hardware with sufficient capacity to meet your data retention requirements while disconnected

Store important messages durably offline and, once reconnected, send those messages to the cloud. Device hardware should have capabilities to store data locally for a finite period of time to prevent any loss of information.

Recommendation IOTREL_5.2.1You can leverage the device edge software capabilities for storing data locally.

Best practice IOTREL_5.3 – Down sample data to reduce storage requirements and network utilization

Data should be down sampled where possible to reduce storage in the device and lower transmission costs and reduce network pressure.

Recommendation IOTREL_5.3.1Use device edge software capabilities for down sampling

  • Using AWS IoT Greengrass for device software to down sample data.

    • Local Lambda functions can be used on AWS IoT Greengrass to down sample the data before sending it to the cloud.

  • ETL with AWS IoT Extract, Transform, Load with AWS IoT Greengrass Solution Accelerator helps to quickly set up an edge device with AWS IoT Greengrass to perform extract, transform, and load (ETL) functions on data gathered from local devices before being sent to AWS.

Best practice IOTREL_5.4 – Use an exponential backoff with jitter and retry logic to connect remote devices to the cloud

Consider implementing a retry mechanism for IoT device software. The retry mechanism should have exponential backoff with a randomization factor built in to avoid retries from multiple devices occurring simultaneously. Implementing retry logic with exponential backoff with jitter allows the IoT devices to more evenly distribute their traffic and prevent them from creating unnecessary peak traffic.

Recommendation IOTREL_5.4.1Implement logic in the cloud to notify the device operator if a device has not connected for an extended period of time

Recommendation IOTREL_5.4.2Use device edge software and the SDK to leverage built-in exponential back off logic

Recommendation IOTREL_5.4.3Establish alternate network channels to meet requirements

  • Have a separate failover network channel to deliver critical messages to AWS IoT. Failover channels can include Wi-Fi, cellular networks, or a wireless personal network.

IOTREL 06. How do you control the frequency of message delivery to the device?

Devices can be restricted in message processing capacity and messages from the cloud might need to be throttled. The cloud-side message delivery rate might need to be architected based on the type of devices that are connected.

Best practice IOTREL_6.1 – Target messages to relevant devices

Devices receive information from shadow updates, or from messages published to topics they subscribe to. Some data are relevant only to specific devices. In those cases, design your workload to send messages to relevant devices only, and to remove any data that is not relevant to those devices.

Recommendation IOTREL_6.1.1Preprocess data to support the specific needs of the device

  • Use AWS Lambda to pre-process the data and hone-in specifically to attributes and variables that are needed by the device to act upon

Best practice IOTREL_6.2 – Implement retry and backoff logic to support throttling by device type

Retry and backoff logic should be implemented in a controlled manner so that when you need to alter throttling settings per device type, you can easily do it. Using data storage of any chosen kind gives you flexibility on what data to publish down to the device.

Recommendation IOTREL_6.2.1Use storage mechanisms that enable retry mechanisms

  • Using DynamoDB, you can hold data in key value format where device ID is the key. Retry logic can be applied to only certain device IDs.

  • Using Amazon Relational Database Service (Amazon RDS), you have the flexibility to use a variety of database engines. The retry messages can have new real-time data augmented with historic data from previous device interactions stored in Amazon RDS.

  • AWS IoT Events provides state machines with built-in timers to hold back data and retry based on timers.