Using the Analyze Lending Workflow
To detect text in, or analyze multipage lending documents, using the Analyze Lending
workflow, you do the following:
-
Create the Amazon SNS topic and the Amazon SQS queue.
-
Subscribe the queue the topic.
-
Give the topic permission to send messages to the queue.
-
Start processing the document. Call StartLendingAnalysis
operation.
-
Get the completion status from the Amazon SQS queue. The example code tracks the
job identifier (JobId
) that's returned by the Start
operation. The example code only gets the results for matching job identifiers
that are read from the completion status. This is important if other
applications are using the same queue and topic. For simplicity, the example
code deletes jobs that don't match. Consider adding the deleted jobs to an Amazon SQS
dead-letter queue for further investigation.
The results of the StartLendingAnalysis operation can be sent to an Amazon S3
bucket of your choice by using the OutputConfig feature. If you use this
feature, you may have to do some additional configuration of your User and
Service Role. For information on how to let Amazon Textract send encrypted
documents to your Amazon S3 bucket, see Permissions for Output
Configuration.
-
Get and display the processing results by calling the
GetLendingAnalysis
operation or the
GetLendingAnalysisSummary
operation.
-
Once you are finished processing documents, be sure to delete the Amazon SNS topic
and the Amazon SQS queue. If you need to process additional documents, you can leave
the Amazon SNS topic and Amazon SQS queue as they are and reuse them for the
other documents.
Performing Asynchronous Lending Analysis
The example code for this procedure is provided for Python and the AWS CLI. Before
you begin, install the appropriate AWS SDK. For more information, see Step 2: Set Up the AWS CLI and AWS SDKs.
-
Configure user access to Amazon Textract, and configure Amazon Textract
access to Amazon SNS. For more information, see Configuring Amazon Textract for Asynchronous Operations. To
complete this procedure, you need a multipage document file in PDF format.
You can skip steps 3 – 6 in the configuration instructions, because the
example code creates and configures the Amazon SNS topic and Amazon SQS
queue. If completing the CLI example, you don't need to set up an SQS queue.
-
Upload a multipage document file in PDF or TIFF format to your Amazon S3
bucket (you can also process single-page documents in JPEG, PNG, TIFF, or
PDF formats). For instructions, see Uploading Objects into Amazon S3in the Amazon Simple Storage Service User Guide.
-
Use the following AWS SDK for Python (Boto3) or AWS CLI code to analyze
text in a multipage lending document. In the main function:
-
Replace the value of roleArn
with the IAM role ARN
that you saved in Giving Amazon Textract Access to Your Amazon SNS
Topic.
-
Replace the values of bucket
and
document
with the bucket and document file name
that you previously specified in the proceeding Step 2.
-
Replace the value of the type
input parameter of the
ProcessDocument
function with the type of
processing that you want to use. For example, use
ProcessType.DETECTION
to detect text, or use
ProcessType.ANALYSIS
to analyze text.
-
For the Python example, replace the value of
region_name
with the region your client is
operating in.
For the upcoming AWS CLI example code, do the following:
-
When calling the StartLendingAnalysis operation, replace the value of
bucket-name
with the name of your S3 bucket, and
replace FileName
with the name of the file you
specified in step 2. Specify the region of your bucket by replacing
region-name
with the name of your region. Take note
that the CLI example does not make use of SQS.
-
When calling the GetLendingAnalysis operation or the GetLendingAnalysisSummary operation, replace
jobId
with the jobId
returned by
StartLendingAnalysis. Specify the region of your bucket
by replacing region-name
with the name of your
region.
-
Run the code for your chosen SDK or the AWS CLI.
The operation might take a while to finish. After it's finished, a list of
blocks for detected or analyzed text is displayed by the follwing
examples:
- AWS CLI
-
To start the lending document analysis use the following CLI
command. If you want to see splitted documents, use the
output-config
argument, otherwise you can
remove it :
aws textract start-lending-analysis \
--document-location '{"S3Object":{"Bucket":"S3Bucket","Name":"FileName"}}' \
--output-config '{"S3Bucket": "S3Bucket", "S3Prefix": "S3Prefix"}' \
--kms-key-id '1234abcd-12ab-34cd-56ef-1234567890ab' \
--region 'region-name'
To get the results of the lending document analysis use the
following CLI command. The max-results
argument is
optional, and if you don't want to limit the number of results
returned you can remove it:
aws textract get-lending-analysis \
--job-id 'jobId' \
--region 'us-west-2' \
--max-results 30
To retrieve a summary of the results:
aws textract get-lending-analysis-summary \
--job-id 'jobId' \
--region 'us-west-2'
- Python
-
import boto3
import json
import sys
import time
class DocumentProcessor:
def __init__(self, role, bucket, document, region):
self.roleArn = role
self.bucket = bucket
self.document = document
self.region_name = region
self.textract = boto3.client('textract', region_name=self.region_name)
self.sqs = boto3.client('sqs')
self.sns = boto3.client('sns')
def ProcessDocument(self):
jobFound = False
response = self.textract.start_lending_analysis(
DocumentLocation={'S3Object': {'Bucket': self.bucket, 'Name': self.document}},
NotificationChannel={'RoleArn': self.roleArn, 'SNSTopicArn': self.snsTopicArn})
print('Processing type: Analysis')
print('Start Job Id: ' + response['JobId'])
dotLine = 0
while jobFound == False:
sqsResponse = self.sqs.receive_message(QueueUrl=self.sqsQueueUrl, MessageAttributeNames=['ALL'],
MaxNumberOfMessages=10)
if sqsResponse:
if 'Messages' not in sqsResponse:
if dotLine < 40:
print('.', end='')
dotLine = dotLine + 1
else:
print()
dotLine = 0
sys.stdout.flush()
time.sleep(5)
continue
for message in sqsResponse['Messages']:
notification = json.loads(message['Body'])
textMessage = json.loads(notification['Message'])
print(textMessage['JobId'])
print(textMessage['Status'])
if str(textMessage['JobId']) == response['JobId']:
print('Matching Job Found:' + textMessage['JobId'])
jobFound = True
self.GetResults(textMessage['JobId'])
self.GetSummary(textMessage['JobId'])
self.sqs.delete_message(QueueUrl=self.sqsQueueUrl,
ReceiptHandle=message['ReceiptHandle'])
else:
print("Job didn't match:" +
str(textMessage['JobId']) + ' : ' + str(response['JobId']))
# Delete the unknown message. Consider sending to dead letter queue
self.sqs.delete_message(QueueUrl=self.sqsQueueUrl,
ReceiptHandle=message['ReceiptHandle'])
print('Done!')
def CreateTopicandQueue(self):
millis = str(int(round(time.time() * 1000)))
# Create SNS topic
snsTopicName = "AmazonTextractTopic" + millis
topicResponse = self.sns.create_topic(Name=snsTopicName)
self.snsTopicArn = topicResponse['TopicArn']
# create SQS queue
sqsQueueName = "AmazonTextractQueue" + millis
self.sqs.create_queue(QueueName=sqsQueueName)
self.sqsQueueUrl = self.sqs.get_queue_url(QueueName=sqsQueueName)['QueueUrl']
attribs = self.sqs.get_queue_attributes(QueueUrl=self.sqsQueueUrl,
AttributeNames=['QueueArn'])['Attributes']
sqsQueueArn = attribs['QueueArn']
# Subscribe SQS queue to SNS topic
self.sns.subscribe(
TopicArn=self.snsTopicArn,
Protocol='sqs',
Endpoint=sqsQueueArn)
# Authorize SNS to write SQS queue
policy = """{{
"Version":"2012-10-17",
"Statement":[
{{
"Sid":"MyPolicy",
"Effect":"Allow",
"Principal" : {{"AWS" : "*"}},
"Action":"sqs:*",
"Resource": "{}",
"Condition":{{
"ArnEquals":{{
"aws:SourceArn": "{}"
}}
}}
}}
]
}}""".format(sqsQueueArn, self.snsTopicArn)
response = self.sqs.set_queue_attributes(
QueueUrl=self.sqsQueueUrl,
Attributes={
'Policy': policy
})
def DeleteTopicandQueue(self):
self.sqs.delete_queue(QueueUrl=self.sqsQueueUrl)
self.sns.delete_topic(TopicArn=self.snsTopicArn)
# Display information about a block
def DisplayExtractInfo(self, response):
results = response['Results']
for page in results:
print("Page Classification: {}".format(page["PageClassification"]["PageType"]))
print("Page Number: {}".format(page["Page"]))
for extract in page["Extractions"]:
for fields, vals in extract['LendingDocument'].items():
for val in vals:
print("Document Type: {}".format(val['Type']))
detections = val['ValueDetections']
for i in detections:
print(i['Text'])
print('Geometry')
print('\tBounding Box: {}'.format(i['Geometry']['BoundingBox']))
print('\tPolygon: {}'.format(i['Geometry']['Polygon']))
def GetSummary(self, jobId):
maxResults = 1000
response = self.textract.get_lending_analysis_summary(JobId=jobId, MaxResults=maxResults)
doc_groups = response['DocumentGroups']
print("Summary info:")
for group in doc_groups:
print("Document type: " + group['Type'])
split_docs = group['SplitDocuments']
for doc in split_docs:
print(doc)
for idx, page in doc.items():
print(str(idx) + " - " + str(page))
def GetResults(self, jobId):
maxResults = 1000
paginationToken = None
finished = False
while finished == False:
response = None
if paginationToken == None:
response = self.textract.get_lending_analysis(JobId=jobId,
MaxResults=maxResults)
else:
response = self.textract.get_lending_analysis(JobId=jobId,
MaxResults=maxResults,
NextToken=paginationToken)
print('Detected Document Text')
print('Pages: {}'.format(response['DocumentMetadata']['Pages']))
self.DisplayExtractInfo(response)
if 'NextToken' in response:
paginationToken = response['NextToken']
else:
finished = True
def main():
roleArn = ''
bucket = ''
document = ''
region_name = ''
analyzer = DocumentProcessor(roleArn, bucket, document, region_name)
analyzer.CreateTopicandQueue()
analyzer.ProcessDocument()
analyzer.DeleteTopicandQueue()
if __name__ == "__main__":
main()