Monitor and Analyze Training Jobs Using Metrics
An Amazon SageMaker training job is an iterative process that teaches a model to make
predictions by
presenting examples from a training dataset. Typically, a training algorithm computes
several metrics, such as training error and prediction accuracy. These metrics help
diagnose
whether the model is learning well and will generalize well for making predictions
on unseen
data. The training algorithm writes the values of these metrics to logs, which SageMaker
monitors
and sends to Amazon CloudWatch in real time. To analyze the performance of your training
job, you can
view graphs of these metrics in CloudWatch. When a training job has completed, you
can also get a
list of the metric values that it computes in its final iteration by calling the
DescribeTrainingJob
operation.
Topics
Training Metrics Sample Notebooks
The following sample notebooks show how to view and plot training metrics:
For instructions how to create and access Jupyter notebook instances that you can
use
to run the examples in SageMaker, see Example Notebooks. To see a list of all the SageMaker samples,
after creating and opening a notebook instance, choose the SageMaker Examples tab. To access the example notebooks that show how to
use training metrics, object2vec_sentence_similarity.ipynb
and
xgboost_abalone.ipynb
., from the Introduction to Amazon algorithms section. To open a notebook, choose
its Use tab, then choose Create
copy.
Defining Training Metrics
SageMaker automatically parses the logs for metrics that built-in algorithms emit and sends those metrics to CloudWatch. If you want SageMaker to parse logs from a custom algorithm and send metrics that the algorithm emits to CloudWatch, you have to specify the metrics that you want SageMaker to send to CloudWatch when you configure the training job. You specify the name of the metrics that you want to send and the regular expressions that SageMaker uses to parse the logs that your algorithm emits to find those metrics.
You can specify the metrics that you want to track with the SageMaker console;, the
SageMaker
Python SDK (https://github.com/aws/sagemaker-python-sdk
Topics
Defining Regular Expressions for Metrics
To find a metric, SageMaker searches the logs that your algorithm emits and finds logs that match the regular expression that you specify for that metric. If you are using your own algorithm, do the following:
-
Make sure that the algorithm writes the metrics that you want to capture to logs
-
Define a regular expression that accurately searches the logs to capture the values of the metrics that you want to send to CloudWatch metrics.
For example, suppose your algorithm emits metrics for training error and
validation error by writing logs similar to the following to stdout
or
stderr
:
Train_error=0.138318; Valid_error = 0.324557;
If you want to monitor both of those metrics in CloudWatch, your
AlgorithmSpecification
would look like the following:
"AlgorithmSpecification": { "TrainingImage":
ContainerName
, "TrainingInputMode": "File", "MetricDefinitions" : [ { "Name": "train:error", "Regex": "Train_error=(.*?);" }, { "Name": "validation:error", "Regex": "Valid_error=(.*?);" } ]}
In the regex for the train:error
metric defined above, the first part
of the regex finds the exact text "Train_error=", and the expression
(.*?);
captures zero or more of any character until the first
semicolon character. In this expression, the parenthesis tell the regex to capture
what is inside them, .
means any character, *
means zero
or more, and ?
means capture only until the first instance of the
;
character.
Defining Training Metrics (Low-level SageMaker API)
Define the metrics that you want to send to CloudWatch by specifying a list of metric
names and regular expressions in the MetricDefinitions
field of the
AlgorithmSpecification
input parameter that you pass to
the CreateTrainingJob
operation. For example, if you want to monitor both the train:error
and
validation:error
metrics in CloudWatch, your
AlgorithmSpecification
would look like the following:
"AlgorithmSpecification": { "TrainingImage":
ContainerName
, "TrainingInputMode": "File", "MetricDefinitions" : [ { "Name": "train:error", "Regex": "Train_error=(.*?);" }, { "Name": "validation:error", "Regex": "Valid_error=(.*?);" } ]}
For more information about defining and running a training job by using the
low-level SageMaker API, see CreateTrainingJob
.
Defining Training Metrics (SageMaker Python SDK)
Define the metrics that you want to send to CloudWatch by specifying a list of metric
names and regular expressions as the metric_definitions
argument when
you initialize an Estimator
object. For example, if you want to monitor
both the train:error
and validation:error
metrics in CloudWatch,
your Estimator
initialization would look like the following:
estimator = Estimator(image_name=
ImageName
, role='SageMakerRole', train_instance_count=1, train_instance_type='ml.c4.xlarge', train_instance_type='ml.c4.xlarge', k=10, sagemaker_session=sagemaker_session, metric_definitions=[ {'Name': 'train:error', 'Regex': 'Train_error=(.*?);'}, {'Name': 'validation:error', 'Regex': 'Valid_error=(.*?);'} ] )
For more information about training by using Amazon SageMaker Python SDK
Define Training Metrics (Console)
You can define metrics for a custom algorithm in the console when you create a training job by providing the name and regular expression (regex) for Metrics.
For example, if you want to monitor both the train:error
and
validation:error
metrics in CloudWatch, your metric definitions would look like the following:
[ { "Name": "train:error", "Regex": "Train_error=(.*?);" }, { "Name": "validation:error", "Regex": "Valid_error=(.*?);" } ]
Monitoring Training Job Metrics ( Console)
You can monitor the metrics that a training job emits in real time in the CloudWatch console.
To monitor training job metrics (CloudWatch console)
-
Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/
. -
Choose Metrics, then choose /aws/sagemaker/TrainingJobs.
-
Choose TrainingJobName.
-
On the All metrics tab, choose the names of the training metrics that you want to monitor.
-
On the Graphed metrics tab, configure the graph options. For more information about using CloudWatch graphs, see Graph Metrics in the Amazon CloudWatch User Guide.
Monitoring Training Job Metrics (SageMaker Console)
You can monitor the metrics that a training job emits in real time by using the SageMaker console.
To monitor training job metrics (SageMaker console)
-
Open the SageMaker console at https://console.aws.amazon.com/sagemaker/
. -
Choose Training jobs, then choose the training job whose metrics you want to see.
-
Choose TrainingJobName.
-
In the Monitor section, you can review the graphs of instance utilization and algorithm metrics.
Example: Viewing a Training and Validation Curve
Typically, you split the data that you train your model on into training and validation datasets. You use the training set to train the model parameters that are used to make predictions on the training dataset. Then you test how well the model makes predictions by calculating predictions for the validation set. To analyze the performance of a training job, you commonly plot a training curve against a validation curve.
Viewing a graph that shows the accuracy for both the training and validation sets over time can help you to improve the performance of your model. For example, if training accuracy continues to increase over time, but, at some point, validation accuracy starts to decrease, you are likely overfitting your model. To address this, you can make adjustments to your model, such as increasing regularization.
For
this example, you can use the
Image-classification-full-training example that is in the
Example notebooks section of your SageMaker notebook instance. If
you don't have a SageMaker notebook instance, create one by following the instructions
at
Step 1: Create an Amazon SageMaker Notebook Instance. If you
prefer, you can follow along with the End-to-End Multiclass Image Classification Example
To view training and validation error curves
-
Open the SageMaker console at https://console.aws.amazon.com/sagemaker/
. -
Choose Notebooks, and then choose Notebook instances.
-
Choose the notebook instance that you want to use, and then choose Open.
-
On the dashboard for your notebook instance, choose SageMaker Examples.
-
Expand the Introduction to Amazon Algorithms section, and then choose Use next to Image-classification-fulltraining.ipynb.
-
Choose Create copy. SageMaker creates an editable copy of the Image-classification-fulltraining.ipynb notebook in your notebook instance.
-
Run all of the cells in the notebook up to the Inference section. You don't need to deploy an endpoint or get inference for this example.
-
After the training job starts, open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/
. -
Choose Metrics, then choose /aws/sagemaker/TrainingJobs.
-
Choose TrainingJobName.
-
On the All metrics tab, choose the train:accuracy and validation:accuracy metrics for the training job that you created in the notebook.
-
On the graph, choose an area that the metric's values to zoom in. You should see something like the following: