Debugging a Failed Model Training - Rekognition

Debugging a Failed Model Training

You might encounter errors during model training. Amazon Rekognition Custom Labels reports training errors in the console and in the response from DescribeProjectVersions.

Errors are either terminal (training can't continue), or they are non-terminal (training can continue). For errors that relate to the contents of the training and testing datasets, you can download the validation results ( a manifest summary and training and testing validation manifests). Use the error codes in the validation results to find further information in this section. This section also provides information for manifest file errors (terminal errors that happen before the manifest file contents are validated).

Note

A manifest is the file used to store the contents of a dataset.

You can fix some errors by using the Amazon Rekognition Custom Labels console. Other errors might require you to make updates to the training or testing manifest files. You might need to make other changes, such as IAM permissions. For more information, see the documentation for individual errors.

Terminal Errors

Terminal errors stop the training of a model. There are 3 categories of terminal training errors – service errors, manifest file errors, and manifest content errors.

In the console, Amazon Rekognition Custom Labels shows terminal errors for a model in the Status message column of the projects page.

If you using the AWS SDK, you can find out if a terminal manifest file error or a terminal manifest content error has occured by checking the response from DescribeProjectVersions. In this case, the Status value is TRAINING_FAILED and StatusMessage field contains the error.

Service Errors

Terminal service errors occur when Amazon Rekognition experiences a service issue and can't continue training. For example, the failure of another service that Amazon Rekognition Custom Labels depends upon. Amazon Rekognition Custom Labels reports service errors in the console as Amazon Rekognition experienced a service issue. If you use the AWS SDK, service errors that occur during training are raised as an InternalServerError exception by CreateProjectVersion and DescribeProjectVersions.

If a service error occurs, retry training of the model. If training continues to fail, contact AWS Support and include any error information reported with the service error.

Terminal Manifest File Errors

Manifest file errors are terminal errors, in the training and testing datasets, that happen at the file level, or across multiple files. Manifest file errors are detected before the contents of the training and testing datasets are validated. Manifest file errors prevent the reporting of non-terminal validation errors. For example, an empty training manifest file generates an The manifest file is empty error. Since the file is empty, no non-terminal JSON Line validation errors can be reported. The manifest summary is also not created.

You must fix manifest file errors before you can train your model.

The following lists the manifest file errors.

Terminal Manifest Content Errors

Manifest content errors are terminal errors that relate to the content within a manifest. For example, if you get the error The manifest file contains insufficient labeled images per label to perform auto-split, training can't finish as there aren't enough labeled images in the training dataset to create a testing dataset.

As well as being reported in the console and in the response from DescribeProjectVersions, the error is reported in the manifest summary along with any other terminal manifest content errors. For more information, see Understanding the Manifest Summary.

Non terminal JSON Line errors are also reported in seperate training and testing validation results manifests. The non-terminal JSON Line errors found by Amazon Rekognition Custom Labels are not necessarily related to the manifest content error(s) that stop training. For more information, see Understanding Training and Testing Validation Result Manifests.

You must fix manifest content errors before you can train your model.

The following are the error messages for manifest content errors.

Non Terminal JSON Line Validation Errors

JSON Line validation errors are non-terminal errors that don't require Amazon Rekognition Custom Labels to stop training a model.

JSON Line validation errors are not shown in the console.

In the training and testing datasets, a JSON Line represents the training or testing information for a single image. Validation errors in a JSON Line, such as an invalid image, are reported in the training and testing validation manifests. Amazon Rekognition Custom Labels completes training using the other, valid, JSON Lines that are in the manifest. For more information, see Understanding Training and Testing Validation Result Manifests. For information about validation rules, see Validation Rules for Manifest Files.

Note

Training fails if there are too many JSON Line errors.

We recommend that you also fix non-terminal JSON Line errors errors as they can potentially cause future errors or impact your model training.

Amazon Rekognition Custom Labels can generate the following non-terminal JSON Line validation errors.