Text extraction workflow
The text extraction workflow extracts text from uploaded documents
(images or .pdf
files) using Amazon Textract.
Text extraction serves as the basis for:
-
The entity detection workflow, to both perform the entity detection and map the entities to physical locations on the page.
-
The redaction workflow, which depends on the entity locations to redact entities on the page.
Important
The text extraction workflow is required for all use cases, and you must run it before the entity detection workflow or redaction workflow.

Text extraction workflow
The process flow for the text extraction workflow is as follows:
-
An EventBridge custom event bus invokes a Step Functions state machine.
-
Based on the content of the event, the state machine determines whether the workflow should process each document.
-
An Amazon Simple Queue Service
(Amazon SQS) queue pushes a message with metadata information for eligible documents (for example, the document location in Amazon S3 or the AWS API to use for analysis). -
A Lambda function consumes the messages from the Amazon SQS queue.
-
The Lambda function retrieves the original document from the Documents S3 bucket, using the metadata information in the queue’s message. If the document is a multi-page
.pdf
file, the solution splits it into individual files for each page, then saves those files in the S3 bucket alongside the original document. -
For each page, the solution calls Amazon Textract with one or more APIs:
-
The DetectDocumentText API runs for every page. This API performs optical character recognition (OCR) on the provided document page and returns all text and their corresponding locations in the document.
-
If you set the RunAmazon TextractAnalyzeAction parameter to
true
in the configuration file, then the solution runs an analysis action based on theDocumentType
property of the current document. This analysis action is either AnalyzeDocument, AnalyzeID, or AnalyzeExpense. These analysis actions provide more domain-specific information about the extracted text of the documents. See How the solution works for more information.
-
-
The Lambda function uploads the results from all Amazon Textract API calls for all document pages to the ML inferences S3 bucket.
-
The Lambda function notifies the calling Step Function of success or failure.
-
If the text extraction succeeds, this step is complete.
-
If the text extraction fails, the solution publishes the event to an Amazon SQS dead-letter queue, configured with a default retention period of four days.
-
-
The Step Functions state machine publishes the success or failure event to the custom event bus, which invokes the workflow orchestrator Lambda function to create the next workflow event.