Orchestrate an ETL pipeline with validation, transformation, and partitioning using AWS Step Functions
Created by Sandip Gangapadhyay (AWS)
Summary
This pattern describes how to build a serverless extract, transform, and load (ETL) pipeline to validate, transform, compress, and partition a large CSV dataset for performance and cost optimization. The pipeline is orchestrated by AWS Step Functions and includes error handling, automated retry, and user notification features.
When a CSV file is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket source folder, the ETL pipeline starts to run. The pipeline validates the content and the schema of the source CSV file, transforms the CSV file to a compressed Apache Parquet format, partitions the dataset by year, month, and day, and stores it in a separate folder for analytics tools to process.
The code that automates this pattern is available on GitHub, in the ETL Pipeline with AWS Step Functions
Prerequisites and limitations
Prerequisites
An active AWS account.
AWS Command Line Interface (AWS CLI) installed and configured with your AWS account, so that you can create AWS resources by deploying an AWS CloudFormation stack. AWS CLI version 2 is recommended. For installation instructions, see Installing, updating, and uninstalling the AWS CLI version 2 in the AWS CLI documentation. For AWS CLI configuration instructions, see Configuration and credential file settings in the AWS CLI documentation.
An Amazon S3 bucket.
A CSV dataset with the correct schema. (The code repository
included with this pattern provides a sample CSV file with the correct schema and data type that you can use.) A web browser that is supported for use with the AWS Management Console. (See the list of supported browsers
.) AWS Glue console access.
AWS Step Functions console access.
Limitations
In AWS Step Functions, the maximum limit for keeping history logs is 90 days. For more information, see Quotas and Quotas for standard workflows in the AWS Step Functions documentation.
Product versions
Python 3.11 for AWS Lambda
AWS Glue version 2.0
Architecture
The workflow illustrated in the diagram consists of these high-level steps:
The user uploads a CSV file into the source folder in Amazon S3.
An Amazon S3 notification event initiates an AWS Lambda function that starts the Step Functions state machine.
The Lambda function validates the schema and data type of the raw CSV file.
Depending on the validation results:
If validation of the source file succeeds, the file moves to the stage folder for further processing.
If validation fails, the file moves to the error folder, and an error notification is sent through Amazon Simple Notification Service (Amazon SNS).
An AWS Glue crawler creates the schema of the raw file from the stage folder in Amazon S3.
An AWS Glue job transforms, compresses, and partitions the raw file into Parquet format.
The AWS Glue job also moves the file to the transform folder in Amazon S3.
The AWS Glue crawler creates the schema from the transformed file. The resulting schema can be used by any analytics job. You can also use Amazon Athena to run ad-hoc queries.
If the pipeline completes without errors, the schema file is moved to the archive folder. If any errors are encountered, the file is moved to the error folder instead.
Amazon SNS sends a notification that indicates success or failure based on the pipeline completion status.
All the AWS resources used in this pattern are serverless. There are no servers to manage.
Tools
AWS services
AWS Glue
– AWS Glue is a fully managed ETL service that makes it easy for customers to prepare and load their data for analytics. AWS Step Functions
– AWS Step Functions is a serverless orchestration service that lets you combine AWS Lambda functions and other AWS services to build business-critical applications. Through the AWS Step Functions graphical console, you see your application’s workflow as a series of event-driven steps. Amazon S3
– Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. Amazon SNS
– Amazon Simple Notification Service (Amazon SNS) is a highly available, durable, secure, fully managed pub/sub messaging service that enables you to decouple microservices, distributed systems, and serverless applications. AWS Lambda
– AWS Lambda is a compute service that lets you run code without provisioning or managing servers. AWS Lambda runs your code only when needed and scales automatically, from a few requests per day to thousands per second.
Code
The code for this pattern is available on GitHub, in the ETL Pipeline with AWS Step Functions
template.yml
– AWS CloudFormation template for creating the ETL pipeline with AWS Step Functions.parameter.json
– Contains all parameters and parameter values. You update this file to change parameter values, as described in the Epics section.myLayer/python
folder – Contains Python packages needed to create the required AWS Lambda layer for this project.lambda
folder – Contains the following Lambda functions:move_file.py
– Moves the source dataset to the archive, transform, or error folder.check_crawler.py
– Checks the status of the AWS Glue crawler as many times as configured by theRETRYLIMIT
environment variable before it sends a failure message.start_crawler.py
– Starts the AWS Glue crawler.start_step_function.py
– Starts AWS Step Functions.start_codebuild.py
– Starts the AWS CodeBuild project.validation.py
– Validates the input raw dataset.s3object.py
– Creates the required directory structure inside the S3 bucket.notification.py
– Sends success or error notifications at the end of the pipeline.
To use the sample code, follow the instructions in the Epics section.
Epics
Task | Description | Skills required |
---|---|---|
Clone the sample code repository. |
| Developer |
Update parameter values. | In your local copy of the repository, edit the
| Developer |
Upload the source code to the S3 bucket. | Before you deploy the CloudFormation template that automates the ETL pipeline, you must package the source files for the CloudFormation template and upload them to an S3 bucket. To do this, run the following AWS CLI command with your preconfigured profile:
where:
| Developer |
Task | Description | Skills required |
---|---|---|
Deploy the CloudFormation template. | To deploy the CloudFormation template, run the following AWS CLI command:
where:
| Developer |
Check progress. | On the AWS CloudFormation console | Developer |
Note the AWS Glue database name. | The Outputs tab for the stack displays the name of the AWS Glue database. The key name is | Developer |
Task | Description | Skills required |
---|---|---|
Start the ETL pipeline. |
| Developer |
Check for the partitioned dataset. | When the ETL pipeline completes, verify that the partitioned dataset is available in the Amazon S3 transform folder ( | Developer |
Check for the partitioned AWS Glue database. |
| Developer |
Run queries. | (Optional) Use Amazon Athena to run ad-hoc queries on the partitioned and transformed database. For instructions, see Running SQL Queries Using Amazon Athena in the AWS documentation. | Database analyst |
Troubleshooting
Issue | Solution |
---|---|
AWS Identity and Access Management (IAM) permissions for the AWS Glue job and crawler | If you further customize the AWS Glue job or the crawler, be sure to grant the appropriate IAM permissions in the IAM role used by the AWS Glue job, or provide data permissions to AWS Lake Formation. For more information, see the AWS documentation. |
Related resources
AWS service documentation
Additional information
The following diagram shows the AWS Step Functions workflow for a successful ETL pipeline, from the Step Functions Inspector panel.
The following diagram shows the AWS Step Functions workflow for an ETL pipeline that fails because of an input validation error, from the Step Functions Inspector panel.