Amazon Textract examples using SDK for Python (Boto3) - AWS SDK Code Examples

There are more AWS SDK examples available in the AWS Doc SDK Examples GitHub repo.

Amazon Textract examples using SDK for Python (Boto3)

The following code examples show you how to perform actions and implement common scenarios by using the AWS SDK for Python (Boto3) with Amazon Textract.

Actions are code excerpts from larger programs and must be run in context. While actions show you how to call individual service functions, you can see actions in context in their related scenarios and cross-service examples.

Scenarios are code examples that show you how to accomplish a specific task by calling multiple functions within the same service.

Each example includes a link to GitHub, where you can find instructions on how to set up and run the code in context.

Topics

Actions

The following code example shows how to use AnalyzeDocument.

SDK for Python (Boto3)
Note

There's more on GitHub. Find the complete example and learn how to set up and run in the AWS Code Examples Repository.

class TextractWrapper: """Encapsulates Textract functions.""" def __init__(self, textract_client, s3_resource, sqs_resource): """ :param textract_client: A Boto3 Textract client. :param s3_resource: A Boto3 Amazon S3 resource. :param sqs_resource: A Boto3 Amazon SQS resource. """ self.textract_client = textract_client self.s3_resource = s3_resource self.sqs_resource = sqs_resource def analyze_file( self, feature_types, *, document_file_name=None, document_bytes=None ): """ Detects text and additional elements, such as forms or tables, in a local image file or from in-memory byte data. The image must be in PNG or JPG format. :param feature_types: The types of additional document features to detect. :param document_file_name: The name of a document image file. :param document_bytes: In-memory byte data of a document image. :return: The response from Amazon Textract, including a list of blocks that describe elements detected in the image. """ if document_file_name is not None: with open(document_file_name, "rb") as document_file: document_bytes = document_file.read() try: response = self.textract_client.analyze_document( Document={"Bytes": document_bytes}, FeatureTypes=feature_types ) logger.info("Detected %s blocks.", len(response["Blocks"])) except ClientError: logger.exception("Couldn't detect text.") raise else: return response
  • For API details, see AnalyzeDocument in AWS SDK for Python (Boto3) API Reference.

The following code example shows how to use DetectDocumentText.

SDK for Python (Boto3)
Note

There's more on GitHub. Find the complete example and learn how to set up and run in the AWS Code Examples Repository.

class TextractWrapper: """Encapsulates Textract functions.""" def __init__(self, textract_client, s3_resource, sqs_resource): """ :param textract_client: A Boto3 Textract client. :param s3_resource: A Boto3 Amazon S3 resource. :param sqs_resource: A Boto3 Amazon SQS resource. """ self.textract_client = textract_client self.s3_resource = s3_resource self.sqs_resource = sqs_resource def detect_file_text(self, *, document_file_name=None, document_bytes=None): """ Detects text elements in a local image file or from in-memory byte data. The image must be in PNG or JPG format. :param document_file_name: The name of a document image file. :param document_bytes: In-memory byte data of a document image. :return: The response from Amazon Textract, including a list of blocks that describe elements detected in the image. """ if document_file_name is not None: with open(document_file_name, "rb") as document_file: document_bytes = document_file.read() try: response = self.textract_client.detect_document_text( Document={"Bytes": document_bytes} ) logger.info("Detected %s blocks.", len(response["Blocks"])) except ClientError: logger.exception("Couldn't detect text.") raise else: return response

The following code example shows how to use GetDocumentAnalysis.

SDK for Python (Boto3)
Note

There's more on GitHub. Find the complete example and learn how to set up and run in the AWS Code Examples Repository.

class TextractWrapper: """Encapsulates Textract functions.""" def __init__(self, textract_client, s3_resource, sqs_resource): """ :param textract_client: A Boto3 Textract client. :param s3_resource: A Boto3 Amazon S3 resource. :param sqs_resource: A Boto3 Amazon SQS resource. """ self.textract_client = textract_client self.s3_resource = s3_resource self.sqs_resource = sqs_resource def get_analysis_job(self, job_id): """ Gets data for a previously started detection job that includes additional elements. :param job_id: The ID of the job to retrieve. :return: The job data, including a list of blocks that describe elements detected in the image. """ try: response = self.textract_client.get_document_analysis(JobId=job_id) job_status = response["JobStatus"] logger.info("Job %s status is %s.", job_id, job_status) except ClientError: logger.exception("Couldn't get data for job %s.", job_id) raise else: return response

The following code example shows how to use StartDocumentAnalysis.

SDK for Python (Boto3)
Note

There's more on GitHub. Find the complete example and learn how to set up and run in the AWS Code Examples Repository.

Start an asynchronous job to analyze a document.

class TextractWrapper: """Encapsulates Textract functions.""" def __init__(self, textract_client, s3_resource, sqs_resource): """ :param textract_client: A Boto3 Textract client. :param s3_resource: A Boto3 Amazon S3 resource. :param sqs_resource: A Boto3 Amazon SQS resource. """ self.textract_client = textract_client self.s3_resource = s3_resource self.sqs_resource = sqs_resource def start_analysis_job( self, bucket_name, document_file_name, feature_types, sns_topic_arn, sns_role_arn, ): """ Starts an asynchronous job to detect text and additional elements, such as forms or tables, in an image stored in an Amazon S3 bucket. Textract publishes a notification to the specified Amazon SNS topic when the job completes. The image must be in PNG, JPG, or PDF format. :param bucket_name: The name of the Amazon S3 bucket that contains the image. :param document_file_name: The name of the document image stored in Amazon S3. :param feature_types: The types of additional document features to detect. :param sns_topic_arn: The Amazon Resource Name (ARN) of an Amazon SNS topic where job completion notification is published. :param sns_role_arn: The ARN of an AWS Identity and Access Management (IAM) role that can be assumed by Textract and grants permission to publish to the Amazon SNS topic. :return: The ID of the job. """ try: response = self.textract_client.start_document_analysis( DocumentLocation={ "S3Object": {"Bucket": bucket_name, "Name": document_file_name} }, NotificationChannel={ "SNSTopicArn": sns_topic_arn, "RoleArn": sns_role_arn, }, FeatureTypes=feature_types, ) job_id = response["JobId"] logger.info( "Started text analysis job %s on %s.", job_id, document_file_name ) except ClientError: logger.exception("Couldn't analyze text in %s.", document_file_name) raise else: return job_id

The following code example shows how to use StartDocumentTextDetection.

SDK for Python (Boto3)
Note

There's more on GitHub. Find the complete example and learn how to set up and run in the AWS Code Examples Repository.

Start an asynchronous job to detect text in a document.

class TextractWrapper: """Encapsulates Textract functions.""" def __init__(self, textract_client, s3_resource, sqs_resource): """ :param textract_client: A Boto3 Textract client. :param s3_resource: A Boto3 Amazon S3 resource. :param sqs_resource: A Boto3 Amazon SQS resource. """ self.textract_client = textract_client self.s3_resource = s3_resource self.sqs_resource = sqs_resource def start_detection_job( self, bucket_name, document_file_name, sns_topic_arn, sns_role_arn ): """ Starts an asynchronous job to detect text elements in an image stored in an Amazon S3 bucket. Textract publishes a notification to the specified Amazon SNS topic when the job completes. The image must be in PNG, JPG, or PDF format. :param bucket_name: The name of the Amazon S3 bucket that contains the image. :param document_file_name: The name of the document image stored in Amazon S3. :param sns_topic_arn: The Amazon Resource Name (ARN) of an Amazon SNS topic where the job completion notification is published. :param sns_role_arn: The ARN of an AWS Identity and Access Management (IAM) role that can be assumed by Textract and grants permission to publish to the Amazon SNS topic. :return: The ID of the job. """ try: response = self.textract_client.start_document_text_detection( DocumentLocation={ "S3Object": {"Bucket": bucket_name, "Name": document_file_name} }, NotificationChannel={ "SNSTopicArn": sns_topic_arn, "RoleArn": sns_role_arn, }, ) job_id = response["JobId"] logger.info( "Started text detection job %s on %s.", job_id, document_file_name ) except ClientError: logger.exception("Couldn't detect text in %s.", document_file_name) raise else: return job_id