Evolve

IOTOPS 3. How do you evolve your IoT application with minimum impact to downstream IoT devices?

IoT solutions frequently involve a combination of low-power devices, remote locations, low bandwidth, and intermittent network connectivity. Each of those factors poses communications challenges, including upgrading firmware and edge applications. Therefore, it's important for you to incorporate and implement an IoT update process that minimizes the impact to downstream devices and operations. In addition to reducing downstream impact, devices must be resilient to common challenges that exist in local environments, such as intermittent network connectivity and power loss. Use a combination of grouping IoT devices for deployment and staggering firmware upgrades over a period of time. Monitor the behavior of devices as they are updated in the field, and proceed only after a percentage of devices have upgraded successfully.

Use AWS IoT Device Management for creating deployment groups of devices and delivering over-the-air (OTA) updates to specific device groups. During upgrades, continue to collect all of the CloudWatch Logs, telemetry, and IoT device job messages and combine that information with the KPIs used to measure overall application health and the performance of any long-running canaries.

Before and after firmware updates, perform a retrospective analysis of operations metrics with participants spanning the business to determine opportunities and methods for improvement. Services such as AWS IoT Analytics and AWS IoT Device Defender are used to track anomalies in overall device behavior, and to measure deviations in performance that may indicate an issue in the updated firmware.

IOTOPS 4. How do you ensure that you are ready to support the operations of devices in your IoT workload?

Operating IoT workloads at scale is different than testing and running prototypes. You need to ensure that your team is prepared and trained to operate a widely distributed IoT data collection application. IoT workloads require your teams to learn new skills and competencies to deliver edge-to-cloud outcomes. Your team needs to be able to pinpoint key operational thresholds that indicate a high level of readiness.

Best practice IOTOPS_4.1 – Train team members supporting your IoT workloads on the lifecycle of IoT applications and your business objectives

Key team members responsible for IoT workloads are trained on major IoT lifecycle events: onboarding, command and control, security, data ingestion, integration, and analytics services. Team members should be able to identify key operational metrics and know how to apply incident response measures. Training team members on the basics of IoT lifecycles and how these align with business objectives provides actionable context on failure scenarios, mitigation strategies, and defining lasting processes that effectively contribute to fewer operational events and less severe impact during events.

Recommendation IOTOPS_4.1.1 – Build IoT operational expertise by having team members and architects’ complete reviews of common IoT architectural patterns, best practices, and educational courses

Introduce new team members to IoT lifecycles with onboarding checklists that include at least one educational course.
Introduce new team members with onboarding checklists that include a step to review, validate, and submit updates to your IoT application architecture documentation and operational monitoring plan.

Recommendation IOTOPS_4.1.2 – Author runbooks for each component of the architecture and train team members on their use

Include guidance for a response procedure for remote devices that are no longer online.
Apply recovery commands for troubleshooting remote devices that are faulty but still online.

IOTOPS 5. How do you assess whether your IoT application meets your operational goals?

Evaluating your operational goals enables you to fine-tune and identify improvements throughout the lifecycle of your IoT application. Measuring and extracting operational and business value from your IoT application allows you to effectively drive high-value initiatives.

Best practice IOTOPS_5.1 – Enable appropriate responses to events

Key operational data elements are those data points that convey some notion of operational health of your application by classifying events. Detecting operational events early can uncover unforeseen risks in your application and give your operations team a head start to prevent or reduce significant business interruption. By defining a minimum set of logs, metrics, and alarms, your operations team can provide a first line of defense against significant business interruption.

Recommendation IOTOPS_5.1.1 – Configure logging to capture and store at least error-level events

Use AWS IoT service logging options to capture error events in CloudWatch Logs

Recommendation IOTOPS_5.1.2 – Create a dashboard for your responders to use in investigations of operational events to rapidly pinpoint the period of time when errors are logged

Group clusters of error events into buckets of time to quickly identify when surges of errors were captured.
Drill down into clusters of errors to identify any patterns to signal for triage response.

Recommendation IOTOPS_5.1.3 – Review the default metrics emitted by your IoT services and configure alarms for metrics that might indicate business interruption

For example:
- Your business deploys a thousand sensors across manufacturing plants and your operations team wants to be alerted if sensors can no longer connect to the cloud and send telemetry.
- Your IT team administering the AWS account reviews the AWS IoT Core metrics and notes the following metrics to monitor: Connect.AuthError, Connect.ClientError, Connect.ClientIDThrottle, Connect.ServerError, Connect.Throttle. Activity in any of these metrics constitutes alerting and investigation.
- Your IT team uses CloudWatch to configure new alarms on these metrics when for any period the metrics’ SUM of Count is greater than zero.
- Your IT team configures an Amazon SNS topic to notify their paging tool when any of the new CloudWatch alarms changes status.
For more information:
- Monitor AWS IoT alarms and metrics using Amazon CloudWatch

Recommendation IOTOPS_5.1.4 – Configure an automated monitoring and alerting tool to detect common symptoms and warnings of operational impact

For example:
- Configure AWS IoT Device Defender to run a daily audit on at least the high and critical checks.
- Configure an Amazon SNS topic to notify a team email list, paging tool, or operations channel when AWS IoT Device Defender reports non-compliant resources in an audit.
For more information:
- AWS IoT Device Defender Audit

IOTOPS 6. How do you govern device fleet provisioning process?

IoT solutions can scale to millions of devices and this requires device fleets to be well planned from the perspectives of provisioning processes and metadata organization. Defining how devices are provisioned must include how the devices are manufactured and how they are registered. Maintain a full chain of security controls over who or what processes can start device provisioning to decrease the likelihood of inviting unintended, or rogue, devices into your fleet.

Best practice IOTOPS_6.1 – Document how devices join your fleet from manufacturing to provisioning

Document the whole device provisioning process to clearly define the responsibilities of different actors at different stages. The end-to-end device provisioning process involves multiple stages owned by different actors. Documenting the plan and processes by which devices onboard and join the fleet affords the appropriate amount of review for potential gaps.

Recommendation IOTOPS_6.1.1 – Document each step (manual and programmatic) of all the stages for the corresponding actors of that stage and clearly define the sequence

Identify the steps at each stage and the corresponding actors.
- Device assembly by hardware manufacturer.
- Device registration by service and solution provider.
- Device activation by the end user of the service or solution provider.
Clearly define and document the dependencies and specific steps for each actor from device manufacturer to the end user.
Document whether devices can self-provision or are user-provisioned and how you can ensure that newly provisioned devices are yours.

Recommendation IOTOPS_6.1.2 – Assign device metadata to enable easy grouping and classification of devices in a fleet

The metadata can be used to group the devices in groups to search and force common actions and behaviors.
For example, you can assign the following metadata at the time of manufacturing:
- Unique ID
- Manufacturer details
- Model number
- Version or generation
- Manufacturing date
If a particular model of a device requires a security patch, then you can easily target the patch to all the devices that are part of the corresponding model number group. Similarly, you can apply the patches to devices manufactured in a specific time frame or belonging to a particular version or generation.

Best practice IOTOPS_6.2 – Use programmatic techniques to provision devices at scale

Scaling the onboarding and provisioning of a large device fleet can be a bottleneck if there is even one manual step per device. Programmatic techniques define patterns of behavior for automating the provisioning process such that authenticated and authorized devices can onboard at any time. This practice ensures a well-documented, reliable, and programmatic provisioning mechanism that is consistent across all devices devoid of any human errors.

Recommendation IOTOPS_6.2.1 – Embed provisioning claims into the devices that are mapped to approval authorities recognized by the provisioning service

Generate a provisioning claim and embed it into the device at the time of manufacturing.
AWS IoT Core can generate and securely deliver certificates and private keys to your devices when they connect to AWS IoT for the first time, using AWS IoT fleet provisioning.

Recommendation IOTOPS_6.2.2 – Use programmatic bootstrapping mechanisms if you are bringing your own certificates

Determine if you will or won’t have device information beforehand
If you don’t have device information beforehand, use just-in-time provisioning (JITP).
- Enable automatic registration and associate a provisioning template with the CA certificate used to sign the device certificate.
- For example, when a device attempts to connect to AWS IoT by using a certificate signed by a registered CA certificate, AWS IoT loads the template from the certificate and initiates the JITP workflow.
If you have device information beforehand, use bulk registration.
- Specify a list of single-thing provisioning template values that are stored in a file in an S3 bucket.
- Run the start-thing-registration-task command to register things in bulk. Provide provisioning template, S3 bucket name, a key name, and a role ARN to the command.

Best practice IOTOPS_6.3 – Use device level features to enable re-provisioning

A birth or bootstrap certificate is a low-privilege unique certificate that is associated with each device during the manufacturing process. The certificate should have a policy to restrict devices to only allow connecting to specific endpoints to initiate provisioning process and fetch the final certificate. Before a device is provisioned, it should be limited in functionality to prevent its misuse. Only after a provisioning process is invoked and approved, should the device be allowed to operate on the system as designed.

Recommendation IOTOPS_6.3.1 – Use a certificate bootstrapping process to establish processes for device assembly, registration, and activation

For example, AWS IoT Core offers a fleet provisioning interface to devices for upgrading a birth certificate to long-lived credentials that enable normal runtime operations.

Recommendation IOTOPS_6.3.2 – Obtain a list of allowed devices from the device manufacturer

Check the allow list file to validate that the device has been fully vetted by the supplier.
Ensure that the list is encrypted, securely stored, and can only be accessed by necessary services and users. Even if the list changes, keep the original list securely stored.
Ensure that this list is securely transferred from the manufacturer to you, is encrypted, and is not publicly accessible.
Ensure that any bootstrap certificate used is signed by a certificate authority (CA) you own or trust.

Best practice IOTOPS_6.4 – Use data-driven auditing metrics to detect if any of your IoT devices might have been broadly accessed

Monitor and detect the abnormal usage patterns and possible misuse of devices and automate the quarantine steps. Programmatic methods to detect and quarantine devices from interacting with cloud resources enable teams to operate a fleet in a scalable way while minimizing a dependency on active human monitoring.

Recommendation IOTOPS_6.4.1 – Use monitoring and logging services to detect anomalous behavior

Once you detect the compromised device, run programmatic actions to quarantine it.

Disable the certificate for further investigation and revoke the certificate to prevent the device from any future use.
Use AWS IoT CloudWatch metrics and logs to monitor for indications of misuse. If you detect misuse, quarantine the device so it does not impact the rest of the platform.
Use AWS IoT Device Defender to identify security issues and deviations from best practices.

IOTOPS 7. Do you organize the fleet to quickly identify devices?

The ability to quickly identify and interact with specific devices gives you the agility to troubleshoot and potentially isolate devices in case you encounter operational challenges. When operating large-scale device fleets, you need to deploy ways to organize, index, and categorize them. This is useful when targeting new device software with updates and when you need to identify why some devices in your fleet behave differently than others.

Best practice IOTOPS_7.1 – Use static and dynamic device hierarchies to support fleet operations

Using a software registry, devices can be categorized into static groups based on their fixed attributes (such as version or manufacturer) and into dynamic groups based on their changing attributes (such as battery percentage or firmware version). Operationalizing devices in groups can help you manage, control, and search for devices more efficiently.

Recommendation IOTOPS_7.1.1 – Manage several devices at once by categorizing them into static groups and hierarchy of groups

Build a hierarchy of static groups for efficient categorization and indexing of your devices.
Use provisioning templates to assign devices to static groups as they are provisioned for the first time.
For example, categorize all sensors of a car under a car group and all the cars under a vehicle group. Child groups inherit policies and permissions attached to their respective parent groups.

Recommendation IOTOPS_7.1.2 – Build a device index to efficiently search for devices, and aggregate registry data, runtime data, and device connectivity data

Use a fleet indexing service to index device and group data.
Use a device index to search registry metadata, stateful metadata, and device connectivity status metadata.
Use a group index to search for groups based on group name, description, attributes, and all parent group names.
For example, if you want to send over-the-air (OTA) updates only to devices that are sufficiently charged, then define a dynamic group for devices with more than 90% battery. Devices will dynamically be added to or removed from the group as their battery percentage crosses the threshold. Send OTA updates to all things under this dynamic group

Best practice IOTOPS_7.2 – Use index and search services to enable rapid identification of target devices

A large IoT deployment can have millions of sensors sending data to the cloud. A separate indexing and search service can make it easy to index and organize the device data, and search for any device by any attribute. Ingesting device data to a search service, for example, Amazon OpenSearch Service (OpenSearch Service), makes it easy to use powerful search, visualization, and analytics capabilities to organize and search for devices. You can ingest your device data and the state to OpenSearch Service seamlessly.

Recommendation IOTOPS_7.2.1 – Use an indexed data store to get, update, or delete device state

Use messaging topics to enable applications and things to get, update, or delete the state information for a Thing (Thing Shadow).
Ingest the shadow data to Firehose through the AWS IoT Core rules engine.
Ingest the data from Firehose to OpenSearch Service through built-in destination options.
Configure search and visualizations on the data directly or through the OpenSearch Dashboards console.
For more information:

IOTOPS 8. How do you monitor the status of your IoT devices?

You need to be able to track the status of your devices. This includes operational parameters and connectivity. You need to know whether devices have disconnected intentionally or not. Monitoring the status of your device fleet is important in helping troubleshoot device software operation and connectivity disruptions.

Best practice IOTOPS_8.1 – Collect lifecycle events from the device fleet

Design a state machine for the device connectivity states, from initialization and first connection, to frequent communication (like keep-alive messages) and final state before going offline. Lifecycle events, such as connection and disconnection, can be used to observe and analyze device behavior over a period of time. Additionally, periodic keep-alive messages can track device connectivity status.

Recommendation IOTOPS_8.1.1 – Subscribe to lifecycle events and monitor the connections using keep-alive messages:

Capture messages from the IoT message broker whenever a device connects or disconnects. Being able to tell the difference between a clean and dirty disconnect is useful in many scenarios where the devices don’t maintain a constant connection to the broker.
Based on the use case and device constraints, have the device send periodic keep-alive messages to AWS IoT Core and monitor, and analyze the keep-alive messages for anomalies.
Ensure that the frequency of sending keep-alive messages is not causing any network storms (perhaps by introducing some jitter) in the network or causing rate limits.

Best practice IOTOPS_8.2 – Configure your devices to communicate their status periodically.

Implement Last Will and Testament (LWT) messages and periodic device keep-alive messages.

Recommendation IOTOPS_8.2.1 – Implement a well-designed device connectivity state machine

Ensure that the device communicates when it first comes online and just prior to going offline.
Implement a wait state for lifecycle events. When a disconnect message is received, wait a period of time and verify that the device is still offline before taking action.
Specify the interval with which each connection should be kept open if no messages are received. AWS IoT drops the connection after that interval unless the device sends a message or a ping.

Recommendation IOTOPS_8.2.2 – Use device connection and disconnection status for anomaly detection

Use connectivity data patterns from devices to detect anomalous behavior in their communication patterns.
For more information:

Best practice IOTOPS_8.3 – Use device state management services to detect status and connectivity patterns

Edge and cloud-side management services monitor and analyze parameters, such as device connectivity status and latency, to help in providing diagnostics and predicting anomalies.

Recommendation IOTOPS_8.3.1 – Monitor device state and connectivity patterns

Use (or develop as needed) device, gateway, edge, and cloud management tools that allow monitoring the fleet of devices
Use logging and monitoring features at all processing points—device, gateway, edge, and cloud, to get a complete picture of device connectivity patterns.

For more information:
- AWS IoT Core - Managing thing indexing

IOTOPS 9. How do you segment your device operations in your IoT application?

You need to segment your device fleet to pinpoint operational challenges and direct incident response to the appropriate responder. Device fleet segmentation enables you to identify conditions under which devices operate sub optimally and minimize response time to security events.

Best practice IOTOPS_9.1 – Use static and dynamic device attributes to identify devices with anomalous behavior

Anomalies in fleet operations might only surface when analyzing metrics that aggregate across the boundaries of your static and dynamic groups or attributes. For example, devices that are running firmware version 2.0.10 and currently have a battery level over 50%. Static and dynamic groups allow for identifying and pinpointing devices in unique ways to monitor, analyze, and take corrective actions on device behavior.

Recommendation IOTOPS_9.1.1 – Pinpoint devices with unusual communication patterns

Use a combination of static and dynamic groups of devices to perform fleet indexing to group devices and identify behavioral patterns—connectivity status, message transmission, etc.
Use lifecycle events, device connectivity, and data transmission patterns to detect anomalies and pinpoint unusual behavior using techniques such as statistical anomaly detection (for large fleet of devices).
Once abnormal behavior has been identified, move rogue and abnormal devices into a different group so that remedial policies can be assigned and implemented on them.

For more information:
- AWS IoT Core - Authorization
- AWS IoT - Device Defender

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Operate

Key AWS services