Orchestrate an ETL pipeline with validation, transformation, and partitioning using AWS Step Functions - AWS Prescriptive Guidance

Orchestrate an ETL pipeline with validation, transformation, and partitioning using AWS Step Functions

Created by Sandip Gangapadhyay (AWS)

Summary

This pattern describes how to build a serverless extract, transform, and load (ETL) pipeline to validate, transform, compress, and partition a large CSV dataset for performance and cost optimization. The pipeline is orchestrated by AWS Step Functions and includes error handling, automated retry, and user notification features.

When a CSV file is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket source folder, the ETL pipeline starts to run. The pipeline validates the content and the schema of the source CSV file, transforms the CSV file to a compressed Apache Parquet format, partitions the dataset by year, month, and day, and stores it in a separate folder for analytics tools to process.

The code that automates this pattern is available on GitHub, in the ETL Pipeline with AWS Step Functions repository.

Prerequisites and limitations

Prerequisites

  • An active AWS account.

  • AWS Command Line Interface (AWS CLI) installed and configured with your AWS account, so that you can create AWS resources by deploying an AWS CloudFormation stack. AWS CLI version 2 is recommended. For installation instructions, see Installing, updating, and uninstalling the AWS CLI version 2 in the AWS CLI documentation. For AWS CLI configuration instructions, see Configuration and credential file settings in the AWS CLI documentation.

  • An Amazon S3 bucket.

  • A CSV dataset with the correct schema. (The code repository included with this pattern provides a sample CSV file with the correct schema and data type that you can use.)

  • A web browser that is supported for use with the AWS Management Console. (See the list of supported browsers.)

  • AWS Glue console access.

  • AWS Step Functions console access.

Limitations

  • In AWS Step Functions, the maximum limit for keeping history logs is 90 days. For more information, see Quotas and Quotas for standard workflows in the AWS Step Functions documentation.

Product versions

  • Python 3.11 for AWS Lambda

  • AWS Glue version 2.0

Architecture

ETL process from S3 source bucket through Step Functions, AWS Glue, and Amazon SNS in 10 steps.

The workflow illustrated in the diagram consists of these high-level steps:

  1. The user uploads a CSV file into the source folder in Amazon S3.

  2. An Amazon S3 notification event initiates an AWS Lambda function that starts the Step Functions state machine.

  3. The Lambda function validates the schema and data type of the raw CSV file.

  4. Depending on the validation results:

    1. If validation of the source file succeeds, the file moves to the stage folder for further processing.

    2. If validation fails, the file moves to the error folder, and an error notification is sent through Amazon Simple Notification Service (Amazon SNS).

  5. An AWS Glue crawler creates the schema of the raw file from the stage folder in Amazon S3.

  6. An AWS Glue job transforms, compresses, and partitions the raw file into Parquet format.

  7. The AWS Glue job also moves the file to the transform folder in Amazon S3.

  8. The AWS Glue crawler creates the schema from the transformed file. The resulting schema can be used by any analytics job. You can also use Amazon Athena to run ad-hoc queries.

  9. If the pipeline completes without errors, the schema file is moved to the archive folder. If any errors are encountered, the file is moved to the error folder instead.

  10. Amazon SNS sends a notification that indicates success or failure based on the pipeline completion status.

All the AWS resources used in this pattern are serverless. There are no servers to manage.

Tools

AWS services

  • AWS Glue – AWS Glue is a fully managed ETL service that makes it easy for customers to prepare and load their data for analytics.

  • AWS Step Functions – AWS Step Functions is a serverless orchestration service that lets you combine AWS Lambda functions and other AWS services to build business-critical applications. Through the AWS Step Functions graphical console, you see your application’s workflow as a series of event-driven steps.

  • Amazon S3 – Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance.

  • Amazon SNS – Amazon Simple Notification Service (Amazon SNS) is a highly available, durable, secure, fully managed pub/sub messaging service that enables you to decouple microservices, distributed systems, and serverless applications.

  • AWS Lambda – AWS Lambda is a compute service that lets you run code without provisioning or managing servers. AWS Lambda runs your code only when needed and scales automatically, from a few requests per day to thousands per second.

Code 

The code for this pattern is available on GitHub, in the ETL Pipeline with AWS Step Functions repository. The code repository contains the following files and folders:

  • template.yml – AWS CloudFormation template for creating the ETL pipeline with AWS Step Functions.

  • parameter.json –  Contains all parameters and parameter values. You update this file to change parameter values, as described in the Epics section.

  • myLayer/python folder –  Contains Python packages needed to create the required AWS Lambda layer for this project.

  • lambda folder –  Contains the following Lambda functions:

    • move_file.py –  Moves the source dataset to the archive, transform, or error folder.

    • check_crawler.py – Checks the status of the AWS Glue crawler as many times as configured by the RETRYLIMIT environment variable before it sends a failure message.

    • start_crawler.py – Starts the AWS Glue crawler.

    • start_step_function.py – Starts AWS Step Functions.

    • start_codebuild.py – Starts the AWS CodeBuild project.

    • validation.py – Validates the input raw dataset.

    • s3object.py – Creates the required directory structure inside the S3 bucket.

    • notification.py – Sends success or error notifications at the end of the pipeline.

To use the sample code, follow the instructions in the Epics section.

Epics

TaskDescriptionSkills required

Clone the sample code repository.

  1. Open the ETL Pipeline with AWS Step Functions repository.

  2. Choose Code on the main repository page, above the file list, and copy the URL listed under Clone with HTTPS.

  3. Change your working directory to the location where you want to store the sample files.

  4. At a terminal or command prompt, type the command:

    git clone <repoURL>

    where <repoURL> refers to the URL you copied in step 2.

Developer

Update parameter values.

In your local copy of the repository, edit the parameter.json file and update the default parameter values as follows:

  • pS3BucketName ─ The name of the S3 bucket for storing the datasets. The template will create this bucket for you. The bucket name must be globally unique.

  • pSourceFolder ─ The name of the folder inside the S3 bucket that will be used to upload the source CSV file.

  • pStageFolder ─ The name of the folder inside the S3 bucket that will be used as the staging area during the process.

  • pTransformFolder ─ The name of the folder inside the S3 bucket that will be used to store transformed and partitioned datasets.

  • pErrorFolder ─ The folder inside the S3 bucket that the source CSV file will be moved to if it can’t be validated.

  • pArchiveFolder ─ The name of the folder inside the S3 bucket that will be used to archive the source CSV file.

  • pEmailforNotification ─ A valid email address for receiving success/error notifications.

  • pPrefix ─ A prefix string that will be used in the AWS Glue crawler name.

  • pDatasetSchema ─ The dataset schema that the source file will be validated against. The Cerberus Python package is used for source dataset validation. For more information, see the Cerberus website.

Developer

Upload the source code to the S3 bucket.

Before you deploy the CloudFormation template that automates the ETL pipeline, you must package the source files for the CloudFormation template and upload them to an S3 bucket. To do this, run the following AWS CLI command with your preconfigured profile:

aws cloudformation package --template-file template.yml --s3-bucket <bucket_name> --output-template-file packaged.template --profile <profile_name>

where:

  • <bucket_name> is the name of an existing S3 bucket in the AWS Region where you want to deploy the stack. This bucket is used to store the source code package for the CloudFormation template.

  • <profile_name> is a valid AWS CLI profile that you preconfigured when you set up AWS CLI.

Developer
TaskDescriptionSkills required

Deploy the CloudFormation template.

To deploy the CloudFormation template, run the following AWS CLI command:

aws cloudformation deploy --stack-name <stack_name> --template-file packaged.template --parameter-overrides file://parameter.json --capabilities CAPABILITY_IAM --profile <profile_name>

where:

  • <stack_name> is a unique identifier for the CloudFormation stack.

  • <profile-name> is your preconfigured AWS CLI profile.

Developer

Check progress.

On the AWS CloudFormation console, check the progress of stack development. When the status is CREATE_COMPLETE, the stack has been deployed successfully.

Developer

Note the AWS Glue database name.

The Outputs tab for the stack displays the name of the AWS Glue database. The key name is GlueDBOutput.

Developer
TaskDescriptionSkills required

Start the ETL pipeline.

  1. Navigate to the source folder (source, or the folder name you set in the parameter.json file) inside the S3 bucket.

  2. Upload a sample CSV file to this folder. (The code repository provides a sample file called Sample_Bank_Transaction_Raw_Dataset.csvthat you can use.) Uploading the file will start the ETL pipeline through Step Functions.

  3. On the Step Functions console, check the ETL pipeline status.

Developer

Check for the partitioned dataset.

When the ETL pipeline completes, verify that the partitioned dataset is available in the Amazon S3 transform folder (transform, or the folder name you set in the parameter.json file).

Developer

Check for the partitioned AWS Glue database.

  1. On the AWS Glue console, select the AWS Glue database created by the stack (this is the database that you noted in the previous epic).

  2. Verify that the partitioned table is available in the AWS Glue Data Catalog.

Developer

Run queries.

(Optional) Use Amazon Athena to run ad-hoc queries on the partitioned and transformed database. For instructions, see Running SQL Queries Using Amazon Athena in the AWS documentation.

Database analyst

Troubleshooting

IssueSolution

AWS Identity and Access Management (IAM) permissions for the AWS Glue job and crawler

If you further customize the AWS Glue job or the crawler, be sure to grant the appropriate IAM permissions in the IAM role used by the AWS Glue job, or provide data permissions to AWS Lake Formation. For more information, see the AWS documentation.

Related resources

AWS service documentation

Additional information

The following diagram shows the AWS Step Functions workflow for a successful ETL pipeline, from the Step Functions Inspector panel.

Step Functions workflow for validating the input .csv, crawling data, and running the AWS Glue job.

The following diagram shows the AWS Step Functions workflow for an ETL pipeline that fails because of an input validation error, from the Step Functions Inspector panel.

Step Functions workflow with failure so the file moves to the error folder.