Monitoring

Dashboard

DeepRacer on AWS automatically provisions an Amazon CloudWatch dashboard which surfaces important graphs, metrics, and alarms that are relevant to operating the solution and can help with identifying potential issues. This dashboard can be accessed through the AWS Management Console by going to the Amazon CloudWatch console and clicking Dashboards in the left sidebar.

The upper-half of this dashboard shows system alarm states (see Alarms), followed by:

Training instance usage metrics, indicating the number of training jobs that are currently in use. This graph is helpful for identifying usage patterns for training and evaluation jobs, and can visually indicate cases such as when your current service quota is being reached.
Training job outcomes, indicating the number of training and evaluation jobs that have completed and/or failed.
Queue metrics, indicating the number of training and evaluation jobs that have been waiting in the queue over time.

The bottom-half of this dashboard shows additional graphs related to:

API performance, indicating the number of requests and latency, as well as the number of 4XX and 5XX errors over time.
Database performance, indicating the number of read and write capacity units being consumed, and the number of user and system errors over time.

Alarms

DeepRacer on AWS automatically provisions CloudWatch alarms to monitor critical system components. These alarms help detect issues early and maintain the health of the deployment.

The deployment includes 9 CloudWatch alarms organized into three monitoring categories:

API import workflow monitoring - 6 alarms that monitor model import processes
User authentication monitoring - 2 alarms that monitor user signup functions
Asset processing monitoring - 1 alarm that monitors asset packaging workflows

API import workflow alarms

These alarms monitor the various stages of importing models into the DeepRacer environment:

Alarm Name	Purpose	Threshold
`ApiimportWorkflowCompletionErrorAlarm`	Monitors completion errors in the API import workflow	≥ 10 errors in 5 minutes
`ApiimportWorkflowRewardValidationErrorAlarm`	Monitors reward validation errors during import processing	≥ 10 errors in 5 minutes
`ApiimportWorkflowDlqProcessorErrorAlarm`	Monitors dead letter queue (DLQ) processor errors	≥ 10 errors in 5 minutes
`ApiimportWorkflowImportAssetsErrorAlarm`	Monitors asset import errors within the workflow	≥ 10 errors in 5 minutes
`ApiimportWorkflowModelValidationErrorAlarm`	Monitors model validation errors during import	≥ 10 errors in 5 minutes
`ApiimportWorkflowImportModelLambdaErrorsAlarm`	Composite alarm for Lambda function errors in the import workflow	Any associated alarm in ALARM state

User authentication alarms

These alarms monitor the Cognito User Pool operations and are more sensitive than workflow alarms:

Alarm Name	Purpose	Threshold
`UserPoolPreSignUpErrorAlarm`	Monitors errors in the pre-signup Lambda trigger function	≥ 1 error in 1 minute
`UserPoolPostSignUpErrorAlarm`	Monitors errors in the post-signup Lambda trigger function	≥ 1 error in 1 minute

Important

User authentication alarms have very sensitive thresholds (1 error) and should be prioritized for notification setup.

Asset processing alarms

This alarm monitors the asset packaging workflow:

Alarm Name	Purpose	Threshold
`ApiAssetPackagingDLQAlarm`	Monitors dead letter queue for asset packaging operations	≥ 5 messages visible in 5 minutes

Configuring alarm actions

By default, the solution creates alarms without automated actions. To receive notifications when alarms trigger:

Go to the CloudWatch console
Select Alarms from the left navigation
Choose an alarm from the list
Select Actions → Edit
Add notification actions such as:
- SNS topic for email alerts
- Auto Scaling actions
- EC2 actions

Viewing alarm status

To check status of an alarm:

Open the CloudWatch console
Select Alarms → All alarms
Filter by the alarm name prefix deepracer-on-aws-
Review the State column for any alarms in ALARM status

When an alarm enters the ALARM state, investigate the associated service logs and metrics to identify the root cause.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Track design templates

Troubleshooting