AWS Glue Best Practices: Building an Operationally Efficient Data Pipeline
Publication date: August 26, 2022 (Document revisions)
Abstract
Data integration is a critical element in building a data lake and a data warehouse. Data integration enables data from different sources to be cleaned, harmonized, transformed, and finally loaded. In the process of building a data warehouse, most of the development efforts are required for building a data integration pipeline. Data integration is one of the most critical elements in data analytics ecosystems. An efficient and well-designed data integration pipeline is critical for making the data available and trusted amongst the analytics consumers.
This whitepaper shows you some of the considerations and best
practices for building and efficiently operating your data pipeline
with AWS Glue
Are you Well-Architected?
The
AWS
Well-Architected Framework
For more expert guidance and best practices for your cloud
architecture—reference architecture deployments, diagrams, and
whitepapers—refer to the
AWS
Architecture Center
Introduction
Data volumes and complexities are increasing at an unprecedented rate, exploding from terabytes to petabytes or even exabytes of data. Traditional on-premises based approaches for bundling a data pipeline do not work well with a cloud-based strategy, and most of the time, do not provide the elasticity and cost effectiveness of cloud native approaches.
AWS hears from customers that they want to extract more value from their data, but struggle to capture, store, and analyze all the data generated by today’s modern and digital businesses. Data is growing exponentially, coming from new sources. It is increasingly diverse, and needs to be securely accessed and analyzed by any number of applications and people.
With changing data and business needs, the focus on building a high performing, cost effective, and low maintenance data pipeline is paramount. Introduced in 2017, AWS Glue is a fully managed, serverless data integration service that allows customers to scale based on their workload, with no infrastructures to manage.
The next section discusses common best practices for building and efficiently operating your data pipeline with AWS Glue. This document is intended for advanced users, data engineers and architects.
To get the most out of this whitepaper, it’s helpful to be
familiar with AWS Glue
-
Refer to AWS Glue Best Practices: Building a Secure and Reliable Data Pipeline for best practices around security and reliability for your data pipelines with AWS Glue.
-
Refer to AWS Glue Best Practices: Building a Performant and Cost Optimized Data Pipeline for best practices around performance efficiency and cost optimization for your data pipelines with AWS Glue.