Document processing

Amazon Comprehend supports one-step document processing for custom classification and custom entity recognition. For example, you can input a mix of plain text documents and semi-structured documents (such as PDF documents, Microsoft Word documents, and images) to a custom analysis job.

For input files that require text extraction, Amazon Comprehend automatically performs the text extraction before running the analysis. To extract the text content, Amazon Comprehend uses an internal parser for native semi-structured documents and uses Amazon Textract APIs for images and scanned documents.

Amazon Comprehend document processing is available in each of the Amazon Comprehend Supported Regions, except Asia Pacific (Tokyo) and AWS GovCloud (US-West) support only plain-text models for custom classification.

The following topics provide details about the input document types that Amazon Comprehend supports for custom analysis.

Topics

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Redacting PII entities

Inputs for real-time analysis