Text extraction workflow - Enhanced Document Understanding on AWS

Text extraction workflow

The text extraction workflow extracts text from uploaded documents (images or .pdf files) using Amazon Textract.

Text extraction serves as the basis for:

  • The entity detection workflow, to both perform the entity detection and map the entities to physical locations on the page.

  • The redaction workflow, which depends on the entity locations to redact entities on the page.

Important

The text extraction workflow is required for all use cases, and you must run it before the entity detection workflow or redaction workflow.

This workflow uses Step Functions, Textract, and other AWS services to extract text.

Text extraction workflow

The process flow for the text extraction workflow is as follows:

  1. An EventBridge custom event bus invokes a Step Functions state machine.

  2. Based on the content of the event, the state machine determines whether the workflow should process each document.

  3. An Amazon Simple Queue Service (Amazon SQS) queue pushes a message with metadata information for eligible documents (for example, the document location in Amazon S3 or the AWS API to use for analysis).

  4. A Lambda function consumes the messages from the Amazon SQS queue.

  5. The Lambda function retrieves the original document from the Documents S3 bucket, using the metadata information in the queue’s message. If the document is a multi-page .pdf file, the solution splits it into individual files for each page, then saves those files in the S3 bucket alongside the original document.

  6. For each page, the solution calls Amazon Textract with one or more APIs:

    1. The DetectDocumentText API runs for every page. This API performs optical character recognition (OCR) on the provided document page and returns all text and their corresponding locations in the document.

    2. If you set the RunAmazon TextractAnalyzeAction parameter to true in the configuration file, then the solution runs an analysis action based on the DocumentType property of the current document. This analysis action is either AnalyzeDocument, AnalyzeID, or AnalyzeExpense. These analysis actions provide more domain-specific information about the extracted text of the documents. See How the solution works for more information.

  7. The Lambda function uploads the results from all Amazon Textract API calls for all document pages to the ML inferences S3 bucket.

  8. The Lambda function notifies the calling Step Function of success or failure.

    1. If the text extraction succeeds, this step is complete.

    2. If the text extraction fails, the solution publishes the event to an Amazon SQS dead-letter queue, configured with a default retention period of four days.

  9. The Step Functions state machine publishes the success or failure event to the custom event bus, which invokes the workflow orchestrator Lambda function to create the next workflow event.