Autopilot Model Insights - Amazon SageMaker

Autopilot Model Insights

Amazon SageMaker Model Monitor report provides insights and quality information for the model candidates generated in the leaderboard of an Autopilot experiment by an AutoML job. The report provides the model insights charts only for the best classification model candidate. This includes understanding false positives/false negatives, tradeoffs between true positives and false positives, and tradeoffs between precision and recall.

Autopilot also provides scalar metrics for all of your candidate models used to measure their predictive quality. The leaderboard view includes these metrics by default. The metrics automatically calculated for a candidate model are determined by the type of problem being addressed.

  • Regression: MSE

  • Binary classification: Accuracy, F1, AUC

  • Multiclass classification: Accuracy, F1macro

You can sort your model candidates with the relevant metric to help you select and deploy the model that addresses your business needs. For definitions of these metrics, see the Autopilot candidate metrics topic.

The SageMaker model monitor report contains details characterizing the Autopilot job, a metrics table, and several model insights. These include model charts that are relevant to the type of classification problem. You access these reports in SageMaker Studio from the Performance tab on the page that opens to confirm that your AutoML job has completed. For instructions on how to create and run an AutoML job in SageMaker Studio, see Create an Amazon SageMaker Autopilot experiment.

Model details and metrics tables

Model details include the following information.

  • Autopilot Candidate Name

  • Autopilot Job Name

  • Problem Type

  • Objective Metric

  • Optimization Direction

The model quality information is generated by the prebuilt SageMaker Model Monitor container. The contents of the report generated depends on the problem type addressed: regression, binary classification, or multiclass classification. The report specifies the number of rows that were included in the evaluation dataset and the time at which the evaluation occurred.

Here is an example of a metrics table in a Model Monitor report generated by AutoML job for a regression problem.


            Amazon SageMaker Autopilot model insights regression metrics report example.

Here is an example of a metrics table in a Model Monitor report generated by AutoML job for a binary classification problem.


            Amazon SageMaker Autopilot model insights binary classification metrics report example.

Here is an example of a metrics table in a Model Monitor report generated by AutoML job for a multiclass classification problem.


            Amazon SageMaker Autopilot model insights multiclass metrics report example.

Confusion matrix

The confusion matrix provides a way to visualize the accuracy of the predictions made by binary and multiclass classification for different classes. The confusion matrix is a table that contains the percentages of correct and incorrect predictions for the actual labels. Each row in the confusion matrix indicates how an actual label was classified by the label predicted by the model. The percentage of accurate predictions is on the diagonal, from the upper-left to the lower-right corner. The off-diagonal percentages indicate the types of misclassification that the model is predicting. These incorrect predictions are the confusion values.

Here is an example of a confusion matrix for a binary classification problem.


            Amazon SageMaker Autopilot binary confusion matrix example.

Here is an example of a confusion matrix for a multi-class classification problem.


            Amazon SageMaker Autopilot multiclass confusion matrix example.

This report provides a confusion matrix that can accommodate a maximum 15 labels for multiclass classification problem types. The labels are listed in order, from those predicted least accurately to those predicted most accurately. If a row shows Nan, it means that the validation dataset doesn't have a row for that label.

The area under the receiver operating characteristic curve

The area under the receiver operating characteristic curve (AUC ROC curve) represents the trade-off between true positive and false positive rates. The AUC ROC curve is an industry-standard accuracy metric used for binary classification models. AUC measures the ability the model to predict a higher score for positive examples, as compared to negative examples. The AUC metric provides an aggregated measure of the model performance across all possible classification thresholds.

The AUC metric returns a decimal value from zero (0) to one (1). AUC values near 1 indicate an ML model that is highly accurate. Values near 0.5 indicate an ML model that is no better than guessing at random. Values near 0 are unusual to see, and these typically indicate a problem with the data. Essentially, an AUC near 0 says that the ML model has learned the correct patterns, but is using them to make predictions that are as inaccurate as possible. For example, 0s are predicted as 1s, and 1s as 0s. For more information about the AUC metric, see the Receiver operating characteristic article on Wikipedia.

A binary model that classifies no-better-than-random guessing, with equal rates of true and false positives, has an AUC score of 0.5. The curve representing a random binary classifier is a diagonal dotted red line in a receiver operating characteristic graph. The curves of more accurate classification models lie above this random baseline, where the rate of true positives exceeds the rate of false positives.


            Amazon SageMaker Autopilot receiver operating characteristic curve example.

The false positive rate (FPR) measures the false alarm rate or the fraction of actual negatives that were falsely predicted as positives. The range is 0 to 1. A smaller value indicates better predictive accuracy.

  • FPR = FP/(FP+TN)

The true positive rate (TPR) measures the fraction actual positives that were predicted as positives. The range is 0 to 1. A larger value (1 being the largest) indicates better predictive accuracy.

  • TPR = TP/(TP+FN)

Where these rates are defined as follows.

  • Correct predictions

    • True positive (TP): The predicted the value is 1, and the true value is 1.

    • True negative (TN): The predicted the value is 0, and the true value is 0.

  • Erroneous predictions

    • False positive (FP): The predicted the value is 1, but the true value is 0.

    • False negative (FN): The predicted the value is 0, but the true value is 1.

Precision-recall curve

The precision-recall curve represents the tradeoff between precision and recall for different thresholds used in a binary classification problem. The objective of a binary classification problem is to correctly classify as many of the relevant elements that labeled positive in a training dataset as possible. A system with high recall but low precision returns lots of relevant results, but a high percentage of its predicted labels is of its labels are incorrect when compared to the training labels. A system with high precision but low recall returns fewer relevant results, but a high percentage of predicted is of its labels are correct when compared to the training labels. A perfect system that has both high precision and high recall produces many correctly-labeled results. For more information, see Precision and recall article in Wikipedia.

Precision measures the fraction of actual positives that are predicted as positive out of all those predicted as positive. The range is 0 to 1. A larger value indicates better accuracy in the values predicted.

  • Precision = TP/(TP+FP)

Recall measures the fraction of actual positives that are predicted as positive out of all of the actual positives in the sample. This is also known as the sensitivity and as the true positive rate. The range is 0 to 1. A larger value indicates better detection of positive values from the sample.

  • Recall = TP/(TP+FN)

Amazon SageMaker Autopilot reports the area under the precision-recall curve (AUPRC). The AUPRC metric provides an aggregated measure of the model performance across all possible classification thresholds.

Here is an example that compares the precision-recall curves and their AUPRC values from four different models trained on the same dataset.


            Amazon SageMaker Autopilot precision-recall curve example.