CSV files (plain text only)
When using comma-separated value (CSV) files for your annotations, your CSV files must have the following columns:
File | Line | Begin offset | End offset | Type |
---|---|---|---|---|
The name of the file containing the document. For example, if one of the document files
is located at |
The line number containing the entity, starting with line 0. |
The character offset in the input text (relative to the beginning of the line) that shows where the entity begins. The first character is at position 0. |
The character offset in the input text that shows where the entity ends. |
The customer-defined entity type. Entity types must be an uppercase,
underscore-separated string. We recommend using descriptive entity types such as
|
Here's an example:
The file documents.txt
contains four lines (rows 0, 1, 2, and 3):
Diego Ramirez is an engineer in the high tech industry. Emilio Johnson has been an engineer for 14 years. J Doe is a judge on the Washington Supreme Court. Our latest new employee, Mateo Jackson, has been a manager in the industry for 4 years.
The CSV file with the list of annotations is as follows:
File, Line, Begin Offset, End Offset, Type documents.txt, 0, 0, 13, ENGINEER documents.txt, 1, 0, 15, ENGINEER documents.txt, 3, 25, 38, MANAGER
In the annotations file, the line number containing the entity starts with line 0. In
this example, line 2 is not present in the CSV file because there is no entity in line 2 of
documents.txt
.
Creating your data files
It's important to put your annotations in a properly configured CSV file to reduce the risk of errors. To manually configure your CSV file, the following must be true:
-
UTF-8 encoding must be explicitly specified, even if its used as a default in most cases.
-
The first line must contain the column headers:
File
,Line
,Begin Offset
,End Offset
,Type
.
We highly recommended that CSV input files are generated programmatically to avoid potential issues.
The following example uses Python to generate a CSV for the annotations shown above:
import csv with open("./annotations/annotations.csv", "w", encoding="utf-8") as csv_file: csv_writer = csv.writer(csv_file) csv_writer.writerow(["File", "Line", "Begin Offset", "End Offset", "Type"]) csv_writer.writerow(["documents.txt", 0, 0, 11, "ENGINEER"]) csv_writer.writerow(["documents.txt", 1, 0, 5, "ENGINEER"]) csv_writer.writerow(["documents.txt", 3, 25, 30, "MANAGER"])