Inputs for real-time custom analysis - Amazon Comprehend

Inputs for real-time custom analysis

Real-time analysis using custom models takes a single document as input. The following topics describe the input document types that you can use.

Plain text documents

Provide the input document as UTF-8-formatted text.

Semi-structured documents

Semi-structured documents include native PDF documents and Word documents.

By default, real-time custom analysis uses the Amazon Comprehend parser to extract the text from Word files and digital PDF files. For PDF files, you can override this default and use Amazon Textract to extract the text. See Setting text extraction options.

Image files and scanned PDF files

Supported image types include JPEG, PNG, and TIFF. The input file maximum is one page, with no more than 10,000 characters.

By default, custom entity recognition uses the Amazon Textract DetectDocumentText API operation to extract the text from image files and scanned PDF files. You can override this default to use the AnalyzeDocument API operation instead. See Setting text extraction options.

Amazon Textract output

You can provide the JSON output from the Amazon Textract DetectDocumentText API or AnalyzeDocument API as input to the real-time API operations for custom classification and custom entity recognition. Amazon Comprehend supports this input type for the real-time API operations, but not for the console.

Maximum document sizes

For all input document types, the input file maximum is one page, with no more than 10,000 characters.

The following table shows the maximum file sizes for input documents.

File type Maximum size (API) Maximum size (console)
UTF-8 text documents 10 KB 10 KB
PDF documents 10 MB 5 MB
Word documents 10 MB 5 MB
Image files 10 MB 5 MB
Textract output files 1 MB n/a

Errors in semi-structured documents

The ClassifyDocument or DetectEntities API operation can encounter document-level or page-level errors while extracting text from a semi-structured document or an image file.

Page-level errors

If the ClassifyDocument or DetectEntities API operation encounters errors while processing a page in the input document, the API response includes an entry in the Errors list for each error.

The ErrorCode in the error list entry contains one of the following values:

  • TEXTRACT_BAD_PAGE – Amazon Textract cannot read the page. For more information about page limits in Amazon Textract, see Page Quotas in Amazon Textract.

  • TEXTRACT_PROVISIONED_THROUGHPUT_EXCEEDED – The number of requests exceeded your throughput limit. For more information about throughput quotas in Amazon Textract, see Default quotas in Amazon Textract.

  • PAGE_CHARACTERS_EXCEEDED – Too many text characters on the page (10,000 characters maximum).

  • PAGE_SIZE_EXCEEDED – The maximum page size is 10 MB.

  • INTERNAL_SERVER_ERROR – The request encountered a service issue. Try the API request again.

Document-level errors

If the ClassifyDocument or DetectEntities API operation detects a document-level error in your input document, the API returns an InvalidRequestException error response.

In the error response, the Reason field contains the value INVALID_DOCUMENT.

The Detail field contains one of the following values:

  • DOCUMENT_SIZE_EXCEEDED – Document size is too large. Check the size of your file and resubmit the request.

  • UNSUPPORTED_DOC_TYPE – Document type is not supported. Check the file type and resubmit the request.

  • PAGE_LIMIT_EXCEEDED – Too many pages in the document. Check the number of pages in your file and resubmit the request.

  • TEXTRACT_ACCESS_DENIED_EXCEPTION – Access denied to Amazon Textract. Verify that your account has permission to use the Amazon Textract DetectDocumentText and AnalyzeDocument API operations and resubmit the request.