Evaluating model performance Assumed threshold Precision Recall F1 Using metrics

Metrics for evaluating your model

After your model is trained, Amazon Rekognition Custom Labels returns metrics from model testing, which you can use to evaluate the performance of your model. This topic describes the metrics available to you, and how to understand if your trained model is performing well.

The Amazon Rekognition Custom Labels console provides the following metrics as a summary of the training results and as metrics for each label:

Precision
Recall
F1

Each metric we provide is a commonly used metric for evaluating the performance of a Machine Learning model. Amazon Rekognition Custom Labels returns metrics for the results of testing across the entire test dataset, along with metrics for each custom label. You are also able to review the performance of your trained custom model for each image in your test dataset. For more information, see Accessing evaluation metrics (Console).

Evaluating model performance

During testing, Amazon Rekognition Custom Labels predicts if a test image contains a custom label. The confidence score is a value that quantifies the certainty of the model’s prediction.

If the confidence score for a custom label exceeds the threshold value, the model output will include this label. Predictions can be categorized in the following ways:

True positive – The Amazon Rekognition Custom Labels model correctly predicts the presence of the custom label in the test image. That is, the predicted label is also a "ground truth" label for that image. For example, Amazon Rekognition Custom Labels correctly returns a soccer ball label when a soccer ball is present in an image.
False positive – The Amazon Rekognition Custom Labels model incorrectly predicts the presence of a custom label in a test image. That is, the predicted label isn’t a ground truth label for the image. For example, Amazon Rekognition Custom Labels returns a soccer ball label, but there is no soccer ball label in the ground truth for that image.
False negative – The Amazon Rekognition Custom Labels model doesn't predict that a custom label is present in the image, but the "ground truth" for that image includes this label. For example, Amazon Rekognition Custom Labels doesn’t return a ‘soccer ball’ custom label for an image that contains a soccer ball.
True negative – The Amazon Rekognition Custom Labels model correctly predicts that a custom label isn't present in the test image. For example, Amazon Rekognition Custom Labels doesn’t return a soccer ball label for an image that doesn’t contain a soccer ball.

The console provides access to true positive, false positive, and false negative values for each image in your test dataset. For more information, see Accessing evaluation metrics (Console).

These prediction results are used to calculate the following metrics for each label, and an aggregate for your entire test set. The same definitions apply to predictions made by the model at the bounding box level, with the distinction that all metrics are calculated over each bounding box (prediction or ground truth) in each test image.

Intersection over Union (IoU) and object detection

Intersection over Union (IoU) measures the percentage of overlap between two object bounding boxes over their combined area. The range is 0 (lowest overlap) to 1 (complete overlap). During testing, a predicted bounding box is correct when the IoU of the ground truth bounding box and the predicted bounding box is at least 0.5.

Assumed threshold

Amazon Rekognition Custom Labels automatically calculates an assumed threshold value (0-1) for each of your custom labels. You can't set the assumed threshold value for a custom label. The assumed threshold for each label is the value above which a prediction is counted as a true or false positive. It is set based on your test dataset. The assumed threshold is calculated based on the best F1 score achieved on the test dataset during model training.

You can get the value of the assumed threshold for a label from the model's training results. For more information, see Accessing evaluation metrics (Console).

Changes to assumed threshold values are typically used to improve the precision and recall of a model. For more information, see Improving an Amazon Rekognition Custom Labels model. Since you can't set a model's assumed threshold for a label, you can achieve the same results by analyzing an image with DetectCustomLabels and specifying MinConfidence input parameter. For more information, see Analyzing an image with a trained model.

Precision

Amazon Rekognition Custom Labels provides precision metrics for each label and an average precision metric for the entire test dataset.

Precision is the fraction of correct predictions (true positives) over all model predictions (true and false positives) at the assumed threshold for an individual label. As the threshold is increased, the model might make fewer predictions. In general, however, it will have a higher ratio of true positives over false positives compared to a lower threshold. Possible values for precision range from 0–1, and higher values indicate higher precision.

For example, when the model predicts that a soccer ball is present in an image, how often is that prediction correct? Suppose there’s an image with 8 soccer balls and 5 rocks. If the model predicts 9 soccer balls—8 correctly predicted and 1 false positive—then the precision for this example is 0.89. However, if the model predicted 13 soccer balls in the image with 8 correct predictions and 5 incorrect, then the resulting precision is lower.

For more information, see Precision and recall.

Recall

Amazon Rekognition Custom Labels provides average recall metrics for each label and an average recall metric for the entire test dataset.

Recall is the fraction of your test set labels that were predicted correctly above the assumed threshold. It is a measure of how often the model can predict a custom label correctly when it's actually present in the images of your test set. The range for recall is 0–1. Higher values indicate a higher recall.

For example, if an image contains 8 soccer balls, how many of them are detected correctly? In this example where an image has 8 soccer balls and 5 rocks, if the model detects 5 of the soccer balls, the recall value is 0.62. If after retraining, the new model detects 9 soccer balls, including all 8 that were present in the image, then the recall value is 1.0.

For more information, see Precision and recall.

F1

Amazon Rekognition Custom Labels uses the F1 score metric to measure the average model performance of each label and the average model performance of the entire test dataset.

Model performance is an aggregate measure that takes into account both precision and recall over all labels. (for example, F1 score or average precision). The model performance score is a value between 0 and 1. The higher the value, the better the model is performing for both recall and precision. Specifically, model performance for classification tasks is commonly measured by F1 score. That score is the harmonic mean of the precision and recall scores at the assumed threshold. For example, for a model with precision of 0.9 and a recall of 1.0, the F1 score is 0.947.

A high value for F1 score indicates that the model is performing well for both precision and recall. If the model isn't performing well, for example, with a low precision of 0.30 and a high recall of 1.0, the F1 score is 0.46. Similarly if the precision is high (0.95) and the recall is low (0.20), the F1 score is 0.33. In both cases, the F1 score is low and indicates problems with the model.

For more information, see F1 score.

Using metrics

For a given model that you have trained and depending on your application, you can make a trade-off between precision and recall by using the MinConfidence input parameter to DetectCustomLabels. At a higher MinConfidence value, you generally get higher precision (more correct predictions of soccer balls), but lower recall (more actual soccer balls will be missed). At a lower MinConfidence value, you get higher recall (more actual soccer balls correctly predicted), but lower precision (more of those predictions will be wrong). For more information, see Analyzing an image with a trained model.

The metrics also inform you on the steps you might take to improve model performance if needed. For more information, see Improving an Amazon Rekognition Custom Labels model.

Note

DetectCustomLabels returns predictions ranging from 0 to 100, which correspond to the metric range of 0-1.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Improving a trained model

Accessing evaluation metrics (Console)