Reference architecture with the AWS Glue product family - AWS Glue Best Practices: Building an Operationally Efficient Data Pipeline

Reference architecture with the AWS Glue product family

While working with customers, AWS encountered several different architecture patterns in which the customer used services from the AWS Glue product family to build their data pipelines. Following are some of the common architecture patterns based on user personas.

Building a data pipeline

Here is a reference architecture for building a data pipeline with AWS Glue product family.

Diagram of a reference architecture for data pipeline with user personas.

Reference architecture for data pipeline with user personas

The steps the data takes in the architecture shown in the preceding figure are as follows:

  1. Data ingestion — Data is extracted from various data sources, including transactional data sources such as customer relationship management/enterprise resource planning (CRM/ERP), on-premises databases such as Oracle and SQL Server, on-premises data stores, Sales as a Service (SaaS) applications such as Salesforce, SAP Concur, and so on for further processing.

  2. Job orchestration — As a new file uploaded into an S3 landing zone or a time-based schedule is triggered, an orchestration workflow is trigger using AWS Step Functions, Amazon Managed Workflow for Apache Airflow (MWAA), or AWS Glue workflow. Depending on the business requirement, workflows are also triggered using a predefined and time-based schedule to process file at certain intervals.

  3. Data cataloging (optional) — The job orchestrator triggers an AWS Glue workflow to crawl the location of the file and build or update AWS Glue Data Catalog. The AWS Glue Data Catalog contains references to data that is used as sources and targets of your ETL jobs in AWS Glue. 

  4. Data streaming — To process streaming data in near real time, customers commonly ingest the data into Amazon Kinesis Data Streams or Amazon MSK. This data can then be consumed by an AWS Glue streaming ETL application for further processing.

  5. Data processing — Processes and transforms data and data format, data quality and integrity checks, deduplications, transformation, and so on.

  6. Data loading — Loads data after processing and transforming to data targets, including data lakes such as S3 locations, relational targets such as Amazon Redshift, Amazon RDS, Amazon Aurora, Amazon OpenSearch Service, or Amazon DynamoDB.

Building a data lake

Following is a reference architecture for building a data lake with the AWS Glue product family:

Diagram of a reference architecture for building a data lake .

Reference architecture for building a data lake

  1. Data ingestion — Data is extracted from various data sources like transactional data sources such as CRM/ERP, on-premises databases such as Oracle and SQL Servers, on-premises data stores, SaaS applications such as Salesforce, SAP Concur, and so on into a landing zone in S3 for further processing. In this step commonly, Amazon AppFlow is used for SaaS applications, AWS Database Migration Service (AWS DMS) is used for ingesting data from on-premises and cloud databases such as AWS Data Exchange and is used for integrating third-party data into the data lake.

  2. Job orchestration — As a new file is uploaded into an S3 landing zone, a Lambda function or event driven AWS Glue workflow triggers orchestration workflow using AWS Step Functions, MWAA, or AWS Glue workflow. Depending on the business requirements, workflows are also triggered using a predefined and time-based schedule to process file at certain intervals.

  3. Data cataloging (optional) — The job orchestrator triggers an AWS Glue workflow to crawl the location of the file and build or update the AWS Glue Data Catalog. The AWS Glue Data Catalog contains references to data that is used as sources and targets for your ETL jobs in AWS Glue. 

  4. Data processing — The data is processes and transforms data that is transforming data and improving data quality, performing integrity checks, and so on.

  5. Data loading — In this step, the processed and transformed data is loaded into data into an S3-based curated zone with appropriate partitions and data format, which is used as a data lake layer.

  6. Unified governanceAWS Lake Formation is commonly used for implementation of unified governance on a data lake. Additionally, if you are looking for transactional capability and small file compaction with AWS Lake Formation, governed tables can also be considered.

  7. Amazon QuickSightAmazon QuickSight allows everyone in your organization to understand your data by asking questions in their natural language, exploring through interactive dashboards, or automatically looking for patterns and outliers powered by ML.

  8. Amazon AthenaAmazon Athena provides capability for ad hoc querying capability on the data stored in the data lake.

  9. Amazon SageMakerAmazon SageMaker and AWS AI services can be used to build, train, and deploy ML models, and add intelligence to your applications.

  10. Logging, monitoring and notificationAmazon CloudWatch can be used for monitoring, Amazon Simple Notification Service (Amazon SNS) can be used for notification, and AWS CloudTrail can be used for logging of events.

Building a streaming data pipeline

Here is a reference architecture for building a streaming data pipeline with the AWS Glue product family.

A diagram of a reference architecture for streaming data pipeline.

Reference architecture for streaming data pipeline

  1. Data Source — In the previous architecture, there are multiple data sources. Near real-time data is generated through streaming data sources such as IoT devices, log and diagnostic data from application servers, and change data capture (CDC) from transactional data stores.

  2. Data steaming — Messages and events are streamed into streaming services such as Amazon Kinesis Data Streams or Amazon MSK.

  3. Stream data processing — In this step, you can create streaming ETL jobs that run continuously and consume data from streaming sources such as Amazon Kinesis Data Streams and Amazon MSK. The jobs cleanse and transform the data.

  4. Stream data loading — The processed data is typically loaded into S3 data lakes or joint database connectivity (JDBC) data stores such as Amazon Redshift or NoSQL data sources such as Amazon DynamoDB or Amazon OpenSearch Service. After the data is loaded, the data can be consumed using services such as Amazon QuickSight, Amazon Athena, Amazon SageMaker, Amazon Managed Grafana, and so on.

  5. Amazon QuickSight:Amazon QuickSight allows everyone in your organization to understand your data by asking questions in their native language, exploring through interactive dashboards, or automatically looking for patterns and outliers powered by ML.

  6. Amazon AthenaAmazon Athena provides capability for ad hoc querying capability on the data stored in the data lake.

  7. Amazon SageMakerAmazon SageMaker and AWS AI services can be used to build, train, and deploy ML models, and add intelligence to your applications.

  8. Amazon Managed GrafanaAmazon Managed Grafana is an open-source analytics platform that can be used to query, visualize, alert on, and understand metrics, no matter where they are stored.

  9. Logging, monitoring, and notification — Amazon CloudWatch can be used for monitoring, Amazon SNS can be used for notification, and AWS CloudTrail can be used for event logging.