Monitoring
Dashboard
DeepRacer on AWS automatically provisions an Amazon CloudWatch dashboard which surfaces important graphs, metrics, and alarms that are relevant to operating the solution and can help with identifying potential issues. This dashboard can be accessed through the AWS Management Console by going to the Amazon CloudWatch console and clicking Dashboards in the left sidebar.
The upper-half of this dashboard shows system alarm states (see Alarms), followed by:
-
Training instance usage metrics, indicating the number of training jobs that are currently in use. This graph is helpful for identifying usage patterns for training and evaluation jobs, and can visually indicate cases such as when your current service quota is being reached.
-
Training job outcomes, indicating the number of training and evaluation jobs that have completed and/or failed.
-
Queue metrics, indicating the number of training and evaluation jobs that have been waiting in the queue over time.
The bottom-half of this dashboard shows additional graphs related to:
-
API performance, indicating the number of requests and latency, as well as the number of 4XX and 5XX errors over time.
-
Database performance, indicating the number of read and write capacity units being consumed, and the number of user and system errors over time.
Alarms
DeepRacer on AWS automatically provisions CloudWatch alarms to monitor critical system components. These alarms help detect issues early and maintain the health of the deployment.
The deployment includes 9 CloudWatch alarms organized into three monitoring categories:
-
API import workflow monitoring - 6 alarms that monitor model import processes
-
User authentication monitoring - 2 alarms that monitor user signup functions
-
Asset processing monitoring - 1 alarm that monitors asset packaging workflows
API import workflow alarms
These alarms monitor the various stages of importing models into the DeepRacer environment:
| Alarm Name | Purpose | Threshold |
|---|---|---|
|
|
Monitors completion errors in the API import workflow |
≥ 10 errors in 5 minutes |
|
|
Monitors reward validation errors during import processing |
≥ 10 errors in 5 minutes |
|
|
Monitors dead letter queue (DLQ) processor errors |
≥ 10 errors in 5 minutes |
|
|
Monitors asset import errors within the workflow |
≥ 10 errors in 5 minutes |
|
|
Monitors model validation errors during import |
≥ 10 errors in 5 minutes |
|
|
Composite alarm for Lambda function errors in the import workflow |
Any associated alarm in ALARM state |
User authentication alarms
These alarms monitor the Cognito User Pool operations and are more sensitive than workflow alarms:
| Alarm Name | Purpose | Threshold |
|---|---|---|
|
|
Monitors errors in the pre-signup Lambda trigger function |
≥ 1 error in 1 minute |
|
|
Monitors errors in the post-signup Lambda trigger function |
≥ 1 error in 1 minute |
Important
User authentication alarms have very sensitive thresholds (1 error) and should be prioritized for notification setup.
Asset processing alarms
This alarm monitors the asset packaging workflow:
| Alarm Name | Purpose | Threshold |
|---|---|---|
|
|
Monitors dead letter queue for asset packaging operations |
≥ 5 messages visible in 5 minutes |
Configuring alarm actions
By default, the solution creates alarms without automated actions. To receive notifications when alarms trigger:
-
Go to the CloudWatch console
-
Select Alarms from the left navigation
-
Choose an alarm from the list
-
Select Actions → Edit
-
Add notification actions such as:
-
SNS topic for email alerts
-
Auto Scaling actions
-
EC2 actions
-
Viewing alarm status
To check status of an alarm:
-
Open the CloudWatch console
-
Select Alarms → All alarms
-
Filter by the alarm name prefix
deepracer-on-aws- -
Review the State column for any alarms in ALARM status
When an alarm enters the ALARM state, investigate the associated service logs and metrics to identify the root cause.