Best Practice 23.1 – Enable appropriate responses to events - IoT Lens Checklist

Best Practice 23.1 – Enable appropriate responses to events

Key operational data elements are those data points that convey some notion of operational health of your application by classifying events. Detecting operational events early can uncover unforeseen risks in your application and give your operations team a head start to prevent or reduce significant business interruption. By defining a minimum set of logs, metrics, and alarms, your operations team can provide a first line of defense against significant business interruption.

Recommendation 23.1.1 – Configure logging to capture and store at least error-level events

Recommendation 23.1.2 – Create a dashboard for your responders to use in investigations of operational events to rapidly pinpoint the period of time when errors are logged

  • Group clusters of error events into buckets of time to quickly identify when surges of errors were captured.

  • Drill down into clusters of errors to identify any patterns to signal for triage response.

  • For example:

    • Create a dashboard in CloudWatch with a widget that runs a CloudWatch Logs Insights query

    • For AWS IoT Core logs, set the log group to AWSIotLogsV2 and the query to filter logLevel="ERROR" | stats count(*) as errorCount by bin(5m) | sort errorCount desc

  • For more:

Recommendation 23.1.3 – Review the default metrics emitted by your IoT services and configure alarms for metrics that might indicate business interruption

  • For example:

    • Your business deploys a thousand sensors across manufacturing plants and your operations team wants to be alerted if sensors can no longer connect to the cloud and send telemetry.

    • Your IT team administering the AWS account reviews the AWS IoT Core metrics and notes the following metrics to monitor: Connect.AuthError, Connect.ClientError, Connect.ClientIDThrottle, Connect.ServerError, Connect.Throttle. Activity in any of these metrics constitutes alerting and investigation.

    • Your IT team uses CloudWatch to configure new alarms on these metrics when for any period the metrics’ SUM of Count is greater than zero.

    • Your IT team configures an Amazon SNS topic to notify their paging tool when any of the new CloudWatch alarms changes status.

  • For more:

Recommendation 23.1.4 – Configure an automated monitoring and alerting tool to detect common symptoms and warnings of operational impact

  • For example:

    • Configure AWS IoT Device Defender to run a daily audit on at least the high and critical checks.

    • Configure an Amazon SNS topic to notify a team email list, paging tool, or operations channel when AWS IoT Device Defender reports non-compliant resources in an audit.

  • For more: