Amazon Comprehend
Developer Guide

Creating a Custom Classifier Using the Console

To create a classifier to identify the custom categories within a set of documents, you first need to create and train the classifier. The classifier is a custom machine learning model that represents the process of classifying documents into the your custom categories or labels. By training the classifier, the learning algorithm that makes up the custom model finds patterns in the training data. With these patterns, it determines what textual parameters correspond to the categories that you've selected. The end result of the training process is a custom model which you can then use to make inferences (predictions) about the categories within documents.

To train the classifier, you need a set of training documents. You label these documents with the categories that you want the document classifier to recognize. For more information on these training documents, see Custom Classification.

To train a document classifier

  1. Sign in to the AWS Management Console and open the Amazon Comprehend console.

  2. From the left menu, choose Customization and then choose Custom Classification.

  3. Choose Train classifier.

  4. Give the classifier a name. The name must be unique within the AWS Region and account.

  5. Select the language of the training documents. You can train a document classifier using any of the languages that work with Amazon Comprehend: English, Spanish, German, Italian, French, or Portuguese. However, you can only train the classifier in one language.

  6. If you choose to encrypt your training job, choose Job encryption and then choose whether to use a KMS key associated with the current account, or one from another account.

    • If you are using a key associated with the current account, for KMS key ID choose the key ID.

    • If you are using a key associated with a different account, for KMS key ARN enter the ARN for the key ID.

    Note

    For more information on creating and using KMS keys and the associated encryption, see Key Management Service (KMS).

  7. Under S3 data location, search for or enter the location of the Amazon S3 bucket that contains the training documents you want to use to train your classifier. The bucket must be in the same region as the API that you are calling. Additionally, the total size of the training documents must be less than 5 GB and you can provide up to 250 classification labels.

  8. Under Input format select the format of the training data. Currently, only the format of one document per line in a file is supported.

  9. Under Input labels S3 location, search for or enter the location of the Amazon S3 bucket that contains the input labels.

  10. (Optional) If you choose to encrypt the output result from your job, choose Encryption and then choose whether to use a KMS key associated with the current account, or one from another account.

    • If you are using a key associated with the current account, for KMS key ID choose the key alias.

    • If you are using a key associated with a different account, for KMS key ID enter the ARN for the key alias or ID.

  11. In the Choose an IAM role section, either select an existing IAM role or create a new one.

    • Choose an existing IAM role – Choose this option if you already have an IAM role with permissions to access the input and output Amazon S3 buckets.

    • Create a new IAM role – Choose this option when you want to create a new IAM role with the proper permissions for Amazon Comprehend to access the input and output buckets.

      Note

      If the input documents are encrypted, the IAM role used must have kms:Decrypt permission. For more information, see Permissions Required to Use KMS Encryption.

  12. (Optional) To launch your resources into Amazon Comprehend from a VPC, enter the VPC ID under VPC or choose the ID from the drop-down list.

    1. Choose the subnet under Subnets(s). After you select the first subnet, you can choose additional ones.

    2. Under Security Group(s), choose the security group to use if you specified one. After you select the first security group, you can choose additional ones.

    Note

    When you use a VPC with your classification job, the DataAccessRole used for the Create and Start operations must have permissions to the VPC from which the input documents and the output bucket are accessed.

  13. (Optional) To add a tag to the custom classifier, enter a key-value pair under Tags. Choose Add tag. To remove this pair before creating the classifier, choose Remove tag.

  14. Choose Create classifier.

The new classifier appears in the list with the status field showing the status. The field can be TRAINING for a classifier that is processing training documents, TRAINED for a document classifier that is ready to use, and IN_ERROR for a document classifier that has an error. You can click on a job to get more information about the classifier, including any error messages.