Deploy an AWS Glue job with an AWS CodePipeline CI/CD pipeline - AWS Prescriptive Guidance

Deploy an AWS Glue job with an AWS CodePipeline CI/CD pipeline

Created by Bruno Klein (AWS) and Luis Henrique Massao Yamada (AWS)

Environment: Production

Technologies: DevOps; Big data

AWS services: AWS Glue; AWS CodeCommit; AWS CodePipeline; AWS Lambda

Summary

This pattern demonstrates how you can integrate Amazon Web Services (AWS) CodeCommit and AWS CodePipeline with AWS Glue, and use AWS Lambda to launch jobs as soon as a developer pushes their changes to a remote AWS CodeCommit repository. 

When a developer submits a change to an extract, transform, and load (ETL) repository and pushes the changes to AWS CodeCommit, a new pipeline is invoked. The pipeline initiates a Lambda function that launches an AWS Glue job with these changes. The AWS Glue job performs the ETL task.

This solution is helpful in the situation where businesses, developers, and data engineers want to launch jobs as soon as changes are committed and pushed to the target repositories. It helps achieve a higher level of automation and reproducibility, therefore avoiding errors during the job launch and lifecycle.

Prerequisites and limitations

Prerequisites 

Limitations

  • The pipeline is finished as soon as the AWS Glue job is successfully launched. It does not wait for the conclusion of the job.

  • The code provided in the attachment is intended for demo purposes only.

Architecture

Target technology stack  

  • AWS Glue

  • AWS Lambda

  • AWS CodePipeline

  • AWS CodeCommit

Target architecture 

The process consists of these steps:

  1. The developer or data engineer makes a modification in the ETL code, commits, and pushes the change to AWS CodeCommit.

  2. The push initiates the pipeline.

  3. The pipeline initiates a Lambda function, which calls codecommit:GetFile on the repository and uploads the file to Amazon Simple Storage Service (Amazon S3).

  4. The Lambda function launches a new AWS Glue job with the ETL code.

  5. The Lambda function finishes the pipeline.

Automation and scale

The sample attachment demonstrates how you can integrate AWS Glue with AWS CodePipeline. It provides a baseline example that you can customize or extend for your own use. For details, see the Epics section.

Tools

  • AWS CodePipeline – AWS CodePipeline is a fully managed continuous delivery service that helps you automate your release pipelines for fast and reliable application and infrastructure updates.

  • AWS CodeCommit – AWS CodeCommit is a fully managed source control service that hosts secure, Git-based repositories.

  • AWS Lambda – AWS Lambda is a serverless compute service that lets you run code without provisioning or managing servers.

  • AWS Glue – AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.

  • Git client – Git provides GUI tools, or you can use the command line or a desktop tool to check out the required artifacts from GitHub. 

  • AWS CDK – The AWS CDK is an open source software development framework that helps you define your cloud application resources by using familiar programming languages.

Epics

TaskDescriptionSkills required
Configure the AWS CLI.

Configure the AWS Command Line Interface (AWS CLI) to target and authenticate with your current AWS account. For instructions, see the AWS CLI documentation.

Developer, DevOps engineer
Extract the sample project files.

Extract the files from the attachment to create a folder that contains the sample project files.

Developer, DevOps engineer
Deploy the sample code.

After you extract the files, run the following commands from the extract location to create a baseline example:

cdk bootstrap cdk deploy git init git remote add origin <code-commit-repository-url> git stage . git commit -m "adds sample code" git push --set-upstream origin main

After the last command, you can monitor the status of the pipeline and the AWS Glue job.

Developer, DevOps engineer
Customize the code.

Customize the code for the etl.py file in accordance with your business requirements. You can revise the ETL code, modify the pipeline stages, or extend the solution.

Data engineer

Related resources

Attachments

To access additional content that is associated with this document, unzip the following file: attachment.zip