Plain-text annotation files - Amazon Comprehend

Plain-text annotation files

For plain-text annotations, you create a comma-separated value (CSV) file that contains a list of annotations. The CSV file must contain the following columns if your training file input format is one document per line.

File Line Begin offset End offset Type

The name of the file containing the document. For example, if one of the document files is located at s3://my-S3-bucket/test-files/documents.txt, the value in the File column will be documents.txt. You must include the file extension (in this case '.txt') as part of the file name.

The line number containing the entity. Omit this column if your input format is one document per file.

The character offset in the input text (relative to the beginning of the line) that shows where the entity begins. The first character is at position 0.

The character offset in the input text that shows where the entity ends.

The customer-defined entity type. Entity types must be an uppercase, underscore-separated string. We recommend using descriptive entity types such as MANAGER, SENIOR_MANAGER, or PRODUCT_CODE. Up to 25 entity types can be trained per model.

If your training file input format is one document per file, you omit the line number column and the Begin offset and End offset values are the offsets of the entity from the start of the document.

The following example is for one document per line. The file documents.txt contains four lines (rows 0, 1, 2, and 3):

Diego Ramirez is an engineer in the high tech industry. Emilio Johnson has been an engineer for 14 years. J Doe is a judge on the Washington Supreme Court. Our latest new employee, Mateo Jackson, has been a manager in the industry for 4 years.

The CSV file with the list of annotations is as follows:

File, Line, Begin Offset, End Offset, Type documents.txt, 0, 0, 13, ENGINEER documents.txt, 1, 0, 14, ENGINEER documents.txt, 3, 25, 38, MANAGER

In the annotations file, the line number containing the entity starts with line 0. In this example, the CSV file contains no entry for line 2 because there is no entity in line 2 of documents.txt.

Creating your data files

It's important to put your annotations in a properly configured CSV file to reduce the risk of errors. To manually configure your CSV file, the following must be true:

  • UTF-8 encoding must be explicitly specified, even if its used as a default in most cases.

  • The first line contains the column headers: File, Line (optional), Begin Offset, End Offset, Type.

We highly recommended that you generate the CSV input files programmatically to avoid potential issues.

The following example uses Python to generate a CSV for the annotations shown earlier:

import csv with open("./annotations/annotations.csv", "w", encoding="utf-8") as csv_file: csv_writer = csv.writer(csv_file) csv_writer.writerow(["File", "Line", "Begin Offset", "End Offset", "Type"]) csv_writer.writerow(["documents.txt", 0, 0, 11, "ENGINEER"]) csv_writer.writerow(["documents.txt", 1, 0, 5, "ENGINEER"]) csv_writer.writerow(["documents.txt", 3, 25, 30, "MANAGER"])