Outputs for asynchronous analysis jobs - Amazon Comprehend

Outputs for asynchronous analysis jobs

After an analysis job completes, it stores the results in the S3 bucket that you specified in the request.

Outputs for text inputs

For either format of text input documents (multi-class or multi-label), the job output consists of a single file named output.tar.gz. It's a compressed archive file that contains a text file with the output.

Multi-class output

When you use a classifier trained in multi-class mode, your results show classes. Each of these classes is the class used to create the set of categories when training your classifier.

For more details about these output fields, see ClassifyDocument in the Amazon Comprehend API Reference.

The following examples use the following mutually exclusive classes.

DOCUMENTARY SCIENCE_FICTION ROMANTIC_COMEDY SERIOUS_DRAMA OTHER

If your input data format is one document per line, the output file contains one line for each line in the input. Each line includes the file name, the zero-based line number of the input line, and the class or classes found in the document. It ends with the confidence that Amazon Comprehend has that the individual instance was correctly classified.

For example:

{"File": "file1.txt", "Line": "0", "Classes": [{"Name": "Documentary", "Score": 0.8642}, {"Name": "Other", "Score": 0.0381}, {"Name": "Serious_Drama", "Score": 0.0372}]} {"File": "file1.txt", "Line": "1", "Classes": [{"Name": "Science_Fiction", "Score": 0.5}, {"Name": "Science_Fiction", "Score": 0.0381}, {"Name": "Science_Fiction", "Score": 0.0372}]} {"File": "file2.txt", "Line": "2", "Classes": [{"Name": "Documentary", "Score": 0.1}, {"Name": "Documentary", "Score": 0.0381}, {"Name": "Documentary", "Score": 0.0372}]} {"File": "file2.txt", "Line": "3", "Classes": [{"Name": "Serious_Drama", "Score": 0.3141}, {"Name": "Other", "Score": 0.0381}, {"Name": "Other", "Score": 0.0372}]}

If your input data format is one document per file, the output file contains one line for each document. Each line has the name of the file and the class or classes found in the document. It ends with the confidence that Amazon Comprehend classified the individual instance accurately.

For example:

{"File": "file0.txt", "Classes": [{"Name": "Documentary", "Score": 0.8642}, {"Name": "Other", "Score": 0.0381}, {"Name": "Serious_Drama", "Score": 0.0372}]} {"File": "file1.txt", "Classes": [{"Name": "Science_Fiction", "Score": 0.5}, {"Name": "Science_Fiction", "Score": 0.0381}, {"Name": "Science_Fiction", "Score": 0.0372}]} {"File": "file2.txt", "Classes": [{"Name": "Documentary", "Score": 0.1}, {"Name": "Documentary", "Score": 0.0381}, {"Name": "Domentary", "Score": 0.0372}]} {"File": "file3.txt", "Classes": [{"Name": "Serious_Drama", "Score": 0.3141}, {"Name": "Other", "Score": 0.0381}, {"Name": "Other", "Score": 0.0372}]}

Multi-label output

When you use a classifier trained in multi-label mode, your results show labels. Each of these labels is the label used to create the set of categories when training your classifier.

The following examples use these unique labels.

SCIENCE_FICTION ACTION DRAMA COMEDY ROMANCE

If your input data format is one document per line, the output file contains one line for each line in the input. Each line includes the file name, the zero-based line number of the input line, and the class or classes found in the document. It ends with the confidence that Amazon Comprehend has that the individual instance was correctly classified.

For example:

{"File": "file1.txt", "Line": "0", "Labels": [{"Name": "Action", "Score": 0.8642}, {"Name": "Drama", "Score": 0.650}, {"Name": "Science Fiction", "Score": 0.0372}]} {"File": "file1.txt", "Line": "1", "Labels": [{"Name": "Comedy", "Score": 0.5}, {"Name": "Action", "Score": 0.0381}, {"Name": "Drama", "Score": 0.0372}]} {"File": "file1.txt", "Line": "2", "Labels": [{"Name": "Action", "Score": 0.9934}, {"Name": "Drama", "Score": 0.0381}, {"Name": "Action", "Score": 0.0372}]} {"File": "file1.txt", "Line": "3", "Labels": [{"Name": "Romance", "Score": 0.9845}, {"Name": "Comedy", "Score": 0.8756}, {"Name": "Drama", "Score": 0.7723}, {"Name": "Science_Fiction", "Score": 0.6157}]}

If your input data format is one document per file, the output file contains one line for each document. Each line has the name of the file and the class or classes found in the document. It ends with the confidence that Amazon Comprehend classified the individual instance accurately.

For example:

{"File": "file0.txt", "Labels": [{"Name": "Action", "Score": 0.8642}, {"Name": "Drama", "Score": 0.650}, {"Name": "Science Fiction", "Score": 0.0372}]} {"File": "file1.txt", "Labels": [{"Name": "Comedy", "Score": 0.5}, {"Name": "Action", "Score": 0.0381}, {"Name": "Drama", "Score": 0.0372}]} {"File": "file2.txt", "Labels": [{"Name": "Action", "Score": 0.9934}, {"Name": "Drama", "Score": 0.0381}, {"Name": "Action", "Score": 0.0372}]} {"File": "file3.txt”, "Labels": [{"Name": "Romance", "Score": 0.9845}, {"Name": "Comedy", "Score": 0.8756}, {"Name": "Drama", "Score": 0.7723}, {"Name": "Science_Fiction", "Score": 0.6157}]}

Outputs for semi-structured input documents

For semi-structured input documents, the output can include the following additional fields:

  • DocumentMetadata – Extraction information about the document. The metadata includes a list of pages in the document, with the number of characters extracted from each page. This field is present in the response if the request included the Byte parameter.

  • DocumentType – The document type for each page in the input document. This field is present in the response if the request included the Byte parameter.

  • Errors – Page-level errors that the system detected while processing the input document. The field is empty if the system encountered no errors.

For more details about these output fields, see ClassifyDocument in the Amazon Comprehend API Reference.

The following example shows output for a two-page scanned PDF file.

[{ #First page output "Classes": [ { "Name": "__label__2 ", "Score": 0.9993996620178223 }, { "Name": "__label__3 ", "Score": 0.0004330444789957255 } ], "DocumentMetadata": { "PageNumber": 1, "Pages": 2 }, "DocumentType": "ScannedPDF", "File": "file.pdf", "Version": "VERSION_NUMBER" }, #Second page output { "Classes": [ { "Name": "__label__2 ", "Score": 0.9993996620178223 }, { "Name": "__label__3 ", "Score": 0.0004330444789957255 } ], "DocumentMetadata": { "PageNumber": 2, "Pages": 2 }, "DocumentType": "ScannedPDF", "File": "file.pdf", "Version": "VERSION_NUMBER" }]