Redaction workflow - Enhanced Document Understanding on AWS

Redaction workflow

The redaction workflow irreversibly redacts text contained in processed documents (shown as black boxes in the UI).

The redaction workflow is unique from the other workflows in the following ways:

  • Includes two separate Lambda functions with shared backing code:

    • One is invoked by the Step Functions workflow as part of a sequence of workflows defined in the workflow configuration.

    • One is manually invoked on processed documents in a case through a REST API.

    These Lambda functions are implemented with the Java runtime.

  • Uploads a redacted document to Amazon S3 rather than storing an inference.

  • Doesn’t interact with the case management store in DynamoDB.

Note

Although this workflow has an option to start from the UI application, to redact specific content (where it is limited to specific entity or phrase in a single request), you can invoke it as a standalone API invocation, with no human interaction or UI. Standalone API invocation supports both phrase redaction and redacting entities from multiple entity types and from entity detection inferences, such as PII and PHI in a single API invocation.

UI workflow

This workflow uses Lambda and other AWS services to redact text using the UI.

UI redaction workflow

The process flow for the redaction workflow within the UI is as follows:

  1. An EventBridge custom event bus invokes a Step Functions state machine.

  2. Based on the content of the event, the state machine determines whether the workflow should process each document.

  3. An Amazon SQS queue pushes a message with eligible documents and metadata information (for example, the document location in Amazon S3 or the AWS API to use for analysis).

  4. A Lambda function consumes the messages from the Amazon SQS queue.

  5. The Lambda function retrieves all entity detection inferences available for the given document from the ML inferences S3 bucket.

  6. The Lambda function retrieves the original document from the Documents S3 bucket.

  7. The Lambda function irreversibly redacts all entities contained in the retrieved inferences files from the document (shown as black boxes over the text).

  8. The Lambda function uploads both the redacted document and original document to the Documents S3 bucket. The redacted document includes -redacted appended to the filename.

  9. The Lambda function notifies the calling Step Functions of success or failure.

    1. If the text extraction succeeds, this step is complete.

    2. If the text extraction fails, the Step Functions state machine publishes the event to an Amazon SQS dead-letter queue, configured with a default retention period of four days.

  10. The solution publishes the success or failure event to the custom event bus.

API workflow

This workflow uses Lambda and other AWS services to redact text using APIs.

API redaction workflow

The API-based redaction workflow is more powerful. You can use the provided API to:

  • Redact specified entities, on specified pages, from the available entity detection inferences.

  • Redact specific phrases which are not part of any entity detection inference on specified pages.

Important

You must run the entity detection workflow before running the API redaction workflow.

The process flow for the API redaction workflow is as follows:

  1. An API Gateway request invokes a Lambda function.

  2. If the API Gateway request specifies entities to redact, the Lambda function retrieves the specified entity detection location inferences for the document from the ML inferences S3 bucket. If the API Gateway request specifies phrases to redact, the Lambda function retrieves the text extraction inference for the document from the ML inferences S3 bucket.

  3. The Lambda function retrieves the original document from the Documents S3 bucket.

  4. The Lambda function irreversibly redacts the entities contained in the API Gateway request from the document (shown as black boxes over the text).

  5. The Lambda function uploads the both the redacted document and original document to the Documents S3 bucket. The redacted document includes -redacted appended to the filename.

  6. The Lambda function sends an HTTP response to API Gateway based on the outcome.

For details about the expected API request body format, see API reference.

Note

The solution can only store one redacted version of a document at a time. Future runs of the redaction workflow will overwrite previously-redacted documents. If you want to retain previous redactions, enable versioning in the S3 bucket.