Monitoring - DeepRacer on AWS

Monitoring

Dashboard

DeepRacer on AWS automatically provisions an Amazon CloudWatch dashboard which surfaces important graphs, metrics, and alarms that are relevant to operating the solution and can help with identifying potential issues. This dashboard can be accessed through the AWS Management Console by going to the Amazon CloudWatch console and clicking Dashboards in the left sidebar.

DeepRacer on AWS dashboard top

The upper-half of this dashboard shows system alarm states (see Alarms), followed by:

  • Training instance usage metrics, indicating the number of training jobs that are currently in use. This graph is helpful for identifying usage patterns for training and evaluation jobs, and can visually indicate cases such as when your current service quota is being reached.

  • Training job outcomes, indicating the number of training and evaluation jobs that have completed and/or failed.

  • Queue metrics, indicating the number of training and evaluation jobs that have been waiting in the queue over time.

DeepRacer on AWS dashboard bottom

The bottom-half of this dashboard shows additional graphs related to:

  • API performance, indicating the number of requests and latency, as well as the number of 4XX and 5XX errors over time.

  • Database performance, indicating the number of read and write capacity units being consumed, and the number of user and system errors over time.

Alarms

DeepRacer on AWS automatically provisions CloudWatch alarms to monitor critical system components. These alarms help detect issues early and maintain the health of the deployment.

The deployment includes 9 CloudWatch alarms organized into three monitoring categories:

  • API import workflow monitoring - 6 alarms that monitor model import processes

  • User authentication monitoring - 2 alarms that monitor user signup functions

  • Asset processing monitoring - 1 alarm that monitors asset packaging workflows

API import workflow alarms

These alarms monitor the various stages of importing models into the DeepRacer environment:

Alarm Name Purpose Threshold

ApiimportWorkflowCompletionErrorAlarm

Monitors completion errors in the API import workflow

≥ 10 errors in 5 minutes

ApiimportWorkflowRewardValidationErrorAlarm

Monitors reward validation errors during import processing

≥ 10 errors in 5 minutes

ApiimportWorkflowDlqProcessorErrorAlarm

Monitors dead letter queue (DLQ) processor errors

≥ 10 errors in 5 minutes

ApiimportWorkflowImportAssetsErrorAlarm

Monitors asset import errors within the workflow

≥ 10 errors in 5 minutes

ApiimportWorkflowModelValidationErrorAlarm

Monitors model validation errors during import

≥ 10 errors in 5 minutes

ApiimportWorkflowImportModelLambdaErrorsAlarm

Composite alarm for Lambda function errors in the import workflow

Any associated alarm in ALARM state

User authentication alarms

These alarms monitor the Cognito User Pool operations and are more sensitive than workflow alarms:

Alarm Name Purpose Threshold

UserPoolPreSignUpErrorAlarm

Monitors errors in the pre-signup Lambda trigger function

≥ 1 error in 1 minute

UserPoolPostSignUpErrorAlarm

Monitors errors in the post-signup Lambda trigger function

≥ 1 error in 1 minute

Important

User authentication alarms have very sensitive thresholds (1 error) and should be prioritized for notification setup.

Asset processing alarms

This alarm monitors the asset packaging workflow:

Alarm Name Purpose Threshold

ApiAssetPackagingDLQAlarm

Monitors dead letter queue for asset packaging operations

≥ 5 messages visible in 5 minutes

Configuring alarm actions

By default, the solution creates alarms without automated actions. To receive notifications when alarms trigger:

  1. Go to the CloudWatch console

  2. Select Alarms from the left navigation

  3. Choose an alarm from the list

  4. Select ActionsEdit

  5. Add notification actions such as:

    • SNS topic for email alerts

    • Auto Scaling actions

    • EC2 actions

Viewing alarm status

To check status of an alarm:

  1. Open the CloudWatch console

  2. Select AlarmsAll alarms

  3. Filter by the alarm name prefix deepracer-on-aws-

  4. Review the State column for any alarms in ALARM status

When an alarm enters the ALARM state, investigate the associated service logs and metrics to identify the root cause.