Failure management - IoT Lens

Failure management

IOTREL 08. How do you implement cloud-side mechanisms to control and modify the message frequency to the device?

Because IoT is an event-driven workload, your application code must be resilient to handling known and unknown errors that can occur as events are permeated through your application. A well-architected IoT application has the ability to log and retry errors in data processing. An IoT application will archive all data in its raw format. By archiving all data, valid and invalid, an architecture can more accurately restore data to a given point in time.

With the IoT rules engine, an application can enable an IoT error action. If a problem occurs when invoking an action, the rules engine will invoke the error action. This allows you to capture, monitor, alert, and eventually retry messages that could not be delivered to their primary IoT action. We recommend that an IoT error action is configured with a different AWS service from the primary action. Use durable storage for error actions such as Amazon SQS or Amazon Kinesis

Beginning with the rules engine, your application logic should initially process messages from a queue and validate that the schema of that message is correct. Your application logic should catch and log any known errors and optionally move those messages to their own dead letter queue (DLQ) for further analysis. Have a catch-all IoT rule that uses Amazon Data Firehose and AWS IoT Analytics channels to transfer all raw and unformatted messages into long-term storage in Amazon S3, AWS IoT Analytics data stores, and Amazon Redshift for data warehousing.

IoT implementations must allow for multiple types of failure at the device level. Failures can be due to hardware, software, connectivity, or unexpected adverse conditions. One way to plan for thing failure is to deploy devices in pairs, if possible, or to deploy dual sensors across a fleet of devices deployed over the same coverage area (meshing).

Regardless of the underlying cause for device failures, if the device can communicate to your cloud application, it should send diagnostic information about the hardware failure to AWS IoT Core using a diagnostics topic. If the device loses connectivity because of the hardware failure, use Fleet Indexing with connectivity status to track the change in connectivity status. If the device is offline for extended periods of time, trigger an alert that the device may require remediation.

Devices can be restricted in message processing capacity and messages from the cloud might need to be throttled. The cloud-side message delivery rate might need to be architected based on the type of devices that are connected to control the frequency of message delivery to the device.

IOTREL 09. How do you plan for disaster recovery (DR) in your IoT workloads?

When companies run their core production operations and cybersecurity functions in the cloud, it is important to design resilience at the edge and cloud in IoT systems. IoT implementations must allow for loss of internet connectivity, local data storage and processing.

Best practice IOTREL_9.1 – Design server software to initiate communication only with devices that are online

Communication should be server initiated with devices that are online rather than client-server requests. This enables you to design client software to accept commands from the server.

Recommendation IOTREL_9.1.1Design client software to accept commands from the server

  • FreeRTOS provides pub/sub and shadow library to connected devices.

  • AWS IoT Core provides device shadow capability to persist device states.

  • AWS IoT Device Registry contains a list of devices connected to AWS IoT Core. AWS IoT Device Registry lets you manage devices by grouping them.

Best practice IOTREL_9.2 – Implement multi-region support for IoT applications and devices

Cloud service providers have the same service in multiple regions. This architecture enables you to divert device data to a regional endpoint that is in not down. Data consumers should be enabled in all regions that consume the diverted device data.

Recommendation IOTREL_9.2.1Architect device software to reach multiple regions in case one is not available

  • AWS IoT is available in multiple Regions with different endpoints. If an endpoint is not available, divert device traffic to a different endpoint.

  • AWS IoT configurable endpoints can be used with Amazon Route 53 to divert IoT traffic to a new regional endpoint.

  • AWS IoT Configurable Endpoints

Recommendation IOTREL_9.2.2Enable device authentication certificates in multiple regions

  • AWS IoT provides devices with authentication certificates to verify on connection. Deploy the device certificates in the Regions where the device will connect.

  • Setup the cloud side IoT data consumers to accept and process data in multiple regions.

  • AWS IoT device registration

Recommendation IOTREL_9.2.3Use device services in all the regions the device connects to

  • AWS IoT Rules Engine diverts device data to use multiple services. Set up AWS IoT Rules Engine in the respective Regions to divert traffic to the appropriate services.

  • Rules for AWS IoT

Best practice IOTREL_9.3 – Use edge devices to store and analyze data

Edge storage can provide additional storage for device data. Data can be stored at the edge during large-scale network events and streamed later, when network is available.

Recommendation IOTREL_9.3.1Use an edge device as a connection point to store and analyze data

  • AWS IoT Greengrass can be used for local processing for serverless functions, containers, messaging, storage, and machine learning inference.

  • Data can be stored in AWS IoT Greengrass and sent to the network when it’s available.

  • AWS IoT Greengrass Features

IOTREL 10. How do you provision reliable storage for IoT data that has been sent to the cloud?

IoT devices send a lot of small messages with no guarantee of delivery order. This data might not be immediately useful, but the data volume is typically low enough to economically store against a future need. It will be beneficial to store the data so that the data can processed in order. Stored data can be reprocessed as new requirements are developed.

Best practice IOTREL_10.1 – Store data before processing

Ensure that the data from the devices is stored before processing. As new requirements and capabilities are added, stored data can be analyzed to meet the new requirements.

Recommendation IOTREL_10.1.1Use IoT Core Rules Engine to send data to Firehose to batch and store data on Amazon Simple Storage Service (Amazon S3)

  • IoT Rules Engine can send data to Firehose to batch and store data on Amazon Simple Storage Service (Amazon S3). Intelligent tiering can be enabled on Amazon S3 to reduce storage costs.

  • Understand the latency to access data and choose the Region to store the data in based on device location.

  • If data will be processed in Amazon EC2 instances, consider using the highly available and low-latency Amazon Elastic Block Store (Amazon EBS).

  • NoSQL data can be stored in Amazon DynamoDB, which is a key-value and document database that delivers single-digit millisecond performance at any scale.

Best practice IOTREL_10.2 – Have mechanisms in place to compensate when the primary storage location is unavailable

There should be recovery plans for failures in storing and accessing device data in the cloud. Understand the recovery point objective (RPO) and recovery time objective (RTO) needed by your application to access data to be used for analysis.

Recommendation IOTREL_10.2.1Know how to monitor and take action on cloud storage failures for IoT data

  • AWS Health Dashboard provides notification and remediation guidance when AWS is experiencing events that might impact you. Storage and access of data can be modified based on the notification.

  • Use Amazon CloudWatch Logs to trigger on events on writing and reading data and take appropriate error handling action.

    • Use AWS IoT rules engine error actions to provision data storage to other locations if primary storage is unavailable.

IOTREL 11. How do you ensure that your device accurately determines UTC?

A secure device should have a valid certificate. IoT devices use a server certificate to communicate to the cloud and the certificate presented uses time for certificate validity. Having reliable and accurate time is compulsory to be able to validate certificates. Because IoT data is not ordered, including an accurate timestamp with the data will enhance your analytic capabilities.

Best practice IOTREL_11.1 – Use NTP to maintain time synchronization on devices

IoT devices need to have a client to keep track of time—either using Real Time Clock (RTC) or Network Time Protocol (NTP) to set the RTC on boot. Failure to provide accurate time to an IoT device could prevent it from being able to connect to the cloud.

Recommendation IOTREL_11.1.1Prefer NTP to RTC when NTP synchronization is available

Many computers have an RTC peripheral that helps in keeping time. Consider that RTC is prone to clock drift of about one second a day, which can result in the device going offline because of certificate invalidity.

Recommendation IOTREL_11.1.2Use Network Time Protocol for connected applications

  • Select a safe, reliable NTP pool to use, and a one that addresses your security design.

  • Many operating systems include an NTP client to sync with an NTP server

  • If the IoT device is using GNU/Linux, it’s likely to include the ntpd daemon

  • You can import an NTP client to your platform if using FreeRTOS

  • The device’s software needs to include an NTP client and should wait until it has synchronized with an NTP server before attempting a connection with AWS IoT Core

  • The system should provide a way for a user to set the device’s time so that subsequent connections can succeed.

  • Use NTP to synchronize RTC on the device to prevent the device from deviating from UTC

  • Chrony is a different implementation of NTP than what ntpd uses and it’s able to synchronize the system clock faster and with better accuracy than ntpd. Chrony can be set up as a client and server.

Best practice IOTREL_11.2 – Provide devices access to NTP servers

An NTP server should be available for clients to use for local time. NTP servers are required by NTP clients to synchronize device time and function properly

Recommendation IOTREL_11.2.2Provide access to NTP services

  • ntp.org – can be used to synchronize your computer clocks.

  • Amazon Time Sync Service – a time synchronization service delivered over NTP, which uses a fleet of redundant satellite-connected and atomic clocks in each Region to deliver a highly accurate reference clock. This is natively accessible from Amazon EC2 instances and this can be pushed to edge devices.

  • Chrony is a different implementation of NTP than what ntpd uses and it’s able to synchronize the system clock faster and with better accuracy than ntpd. Chrony can be set up as a server and client.