Designing an automated solution to analyze PDF files on the AWS Cloud

Tianxia Jia and Yanyan Zhang, Amazon Web Services (AWS)

October 2021 (document history)

Organizations regularly use PDF files to store and transfer different data types, including text, tables, and forms. However, it can be challenging to automatically aggregate and analyze data from different PDF files. For example, an organization's business application might regularly ingest different PDF files with an identical format but that users must individually open and read. This means that users find it difficult to generate useful insights from those PDF files and must manually extract relevant data and use third-party tools for further analysis.

On the Amazon Web Services (AWS) Cloud, Amazon Textract automatically extracts information (for example, printed text, forms, and tables) from PDF files and produces a JSON-formatted file that contains information from the original PDF file. During post-processing, the extracted data is stored in Amazon DynamoDB and you can generate business insights using analytics and visualizations in Amazon QuickSight.

This guide provides a serverless, automated PDF file analysis solution in four phases:

Ingestion phase – Prepare a PDF file type that your organization continuously generates (for example, a daily operations report) and that you need to regularly extract data from.
Processing phase – Extract the data values required by your downstream applications from the PDF files.
Data storage phase – Store the extracted data as a JSON file in Amazon Simple Storage Service (Amazon S3) and as a record in a DynamoDB table.
Analysis phase – Create dashboards in Amazon QuickSight to visualize and help analyze the data.

The guide uses Amazon S3 to store the raw and processed data, AWS Lambda for compute, Amazon Textract to extract content from PDF files, DynamoDB to store the processed data, and Amazon QuickSight for analysis and visualizations. This guide is intended for data scientists, machine learning (ML) engineers, and solutions architects who want to automatically extract information and generate insights from PDF files.

Targeted business outcomes

You should expect the following three outcomes after designing an automated solution to analyze PDF files on the AWS Cloud:

Automatically process raw data from multiple PDF files at scale by using an automated solution that refreshes when new data becomes available.
Downstream modeling and analytics applications (for example, ML modeling in Amazon SageMaker AI) can access the extracted PDF file content.
Data dashboards that show all PDF file contents to your end users in QuickSight.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Reference architecture