How the solution works
You can customize the way this solution processes documents. The solution orchestrates ML inferences, using AWS AI services and their pre-trained models, to extract content and automate processing. The solution’s XML-based configuration provides customization for the orchestration workflow for respective use cases.
Workflows
As shown in the architecture diagram, the solution deploys three workflows: text extraction, entity detection, and redaction. The workflow orchestrator Lambda function orchestrates the order and method of processing uploaded documents using any of these workflows.
Workflow configurations
The solution stores its application workflow configurations in the
workflow-config
DynamoDB table. The solution uses these workflow
configurations to:
-
Set the number and details of documents required to be uploaded to start processing
-
Set the workflows that each document needs to be processed with
The solution creates these DynamoDB table records when the
application is deployed. To create the tables, the solution uses
configuration JSON files available in the workflow-config
directory at the root of the application source code.
To add a new configuration, you can clone an existing record in
the workflow-config
DynamoDB table. You can add these records
after deployment by signing in to the DynamoDB Console. The format
of this configuration file is described below.
You can select which workflow configuration to use by setting the
WorkflowConfigName parameter
during deployment. The
workflow orchestrator Lambda function uses this parameter input as
the key to retrieve the desired configuration form the
workflow-config
DynamoDB table. This parameter has a default value
of default
.
The following JSON object shows a sample workflow configuration
file. During deployment, the solution serializes configuration
files such as these into DynamoDB record data and added to the
workflow-config
table. The key of the table corresponds to the
Name parameter of the
configuration file.
{ "Name": "textractToEntity", "WorkflowSequence": [ "textract", "entity-standard" ], "MinRequiredDocuments": [ { "DocumentType": "generic", "FileTypes": [ ".pdf", ".png", ".jpeg", ".jpg" ], "RunAmazon TextractAnalyzeAction": false, "MaxSize": "5", "WorkflowsToProcess": [ "textract" ] }, { "DocumentType": "receipt", "FileTypes": [ ".pdf", ".png", ".jpeg", ".jpg" ], "RunAmazon TextractAnalyzeAction": true, "AnalyzeDocFeatureType": ["TABLES", "FORMS", "SIGNATURES"], "MaxSize": "5", "WorkflowsToProcess": [ "entity-standard" ] } ] }
The following table describes the details of the configuration.
Parameter | Type | Description | Supported values |
---|---|---|---|
Name | String | Name of the workflow configuration. This corresponds to the WorkflowConfigName CloudFormation parameter required during deployment. | Any |
WorkflowSequence | Array<String> |
The sequence of the document processing workflows to run on an uploaded document, in the order described. NoteThe solution follows the order of items in this list. |
|
MinRequiredDocuments | Array<Map> | This list map object describes the types of documents along with their specs required to execute this workflow. The number of items in this list indicate the number of documents required. The details of the map are described in the following section. | See the following table |
The MinRequiredDocuments parameter in the configuration file is a list of the required documents to create a workflow. Each item in this list corresponds to the configuration of a single document. Workflow processing starts only once all of the required types of documents are uploaded to a case.
Parameter | Type | Description | Supported values |
---|---|---|---|
DocumentType | String |
The user-ascertained type of the uploaded document. Based on the document type, the solution runs the corresponding Amazon Textract analyze action. There are three textract actions to analyze a document:
|
This solution supports the following types of documents
|
FileTypes | Array<string> | File type of the uploaded document |
|
MaxSize | Integer |
Maximum file size in megabytes (MB) of a single uploaded document. NoteThe solution has a maximum page limit of 15 pages. This limit is set to ensure that synchronous Amazon Textract operations can run reliably. Based on your use case, you may choose to customize the limit. Doing so may impact the system’s reliability to handle larger file sizes or additional pages. |
Up to |
WorkflowsToProcess | Array<string> | This list is a subset of the WorkflowSequence parameter described in the previous table. It indicates the type of processing to run on a specific type of document. You can use this parameter to have fine-grained control of the orchestration process. |
|
RunTextractAnalyzeAction | (Optional) Boolean |
If you set this parameter to true , it Amazon Textract
AnalyzeDocument and DetectDocumentText runs for the
document.
|
|
AnalyzeDocFeatureType | (Optional) Array<string> |
The type of features to detect when Amazon Textract
For more information, see AnalyzeDocument in the Amazon Textract Developer Guide. |
List containing any of:
|