Deploy an AWS Glue job with an AWS CodePipeline CI/CD pipeline
Created by Bruno Klein (AWS) and Luis Henrique Massao Yamada (AWS)
Environment: Production | Technologies: DevOps; Big data | AWS services: AWS Glue; AWS CodeCommit; AWS CodePipeline; AWS Lambda |
Summary
This pattern demonstrates how you can integrate Amazon Web Services (AWS) CodeCommit and AWS CodePipeline with AWS Glue, and use AWS Lambda to launch jobs as soon as a developer pushes their changes to a remote AWS CodeCommit repository.
When a developer submits a change to an extract, transform, and load (ETL) repository and pushes the changes to AWS CodeCommit, a new pipeline is invoked. The pipeline initiates a Lambda function that launches an AWS Glue job with these changes. The AWS Glue job performs the ETL task.
This solution is helpful in the situation where businesses, developers, and data engineers want to launch jobs as soon as changes are committed and pushed to the target repositories. It helps achieve a higher level of automation and reproducibility, therefore avoiding errors during the job launch and lifecycle.
Prerequisites and limitations
Prerequisites
An active AWS account
Git
installed on the local machine Amazon Cloud Development Kit (Amazon CDK) installed on the local machine
Python
installed on the local machine The code in the Attachments section
Limitations
The pipeline is finished as soon as the AWS Glue job is successfully launched. It does not wait for the conclusion of the job.
The code provided in the attachment is intended for demo purposes only.
Architecture
Target technology stack
AWS Glue
AWS Lambda
AWS CodePipeline
AWS CodeCommit
Target architecture
The process consists of these steps:
The developer or data engineer makes a modification in the ETL code, commits, and pushes the change to AWS CodeCommit.
The push initiates the pipeline.
The pipeline initiates a Lambda function, which calls
codecommit:GetFile
on the repository and uploads the file to Amazon Simple Storage Service (Amazon S3).The Lambda function launches a new AWS Glue job with the ETL code.
The Lambda function finishes the pipeline.
Automation and scale
The sample attachment demonstrates how you can integrate AWS Glue with AWS CodePipeline. It provides a baseline example that you can customize or extend for your own use. For details, see the Epics section.
Tools
AWS CodePipeline
– AWS CodePipeline is a fully managed continuous delivery service that helps you automate your release pipelines for fast and reliable application and infrastructure updates. AWS CodeCommit
– AWS CodeCommit is a fully managed source control service that hosts secure, Git-based repositories. AWS Lambda
– AWS Lambda is a serverless compute service that lets you run code without provisioning or managing servers. AWS Glue
– AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. Git client
– Git provides GUI tools, or you can use the command line or a desktop tool to check out the required artifacts from GitHub. AWS CDK
– The AWS CDK is an open source software development framework that helps you define your cloud application resources by using familiar programming languages.
Epics
Task | Description | Skills required |
---|---|---|
Configure the AWS CLI. | Configure the AWS Command Line Interface (AWS CLI) to target and authenticate with your current AWS account. For instructions, see the AWS CLI documentation. | Developer, DevOps engineer |
Extract the sample project files. | Extract the files from the attachment to create a folder that contains the sample project files. | Developer, DevOps engineer |
Deploy the sample code. | After you extract the files, run the following commands from the extract location to create a baseline example:
After the last command, you can monitor the status of the pipeline and the AWS Glue job. | Developer, DevOps engineer |
Customize the code. | Customize the code for the etl.py file in accordance with your business requirements. You can revise the ETL code, modify the pipeline stages, or extend the solution. | Data engineer |
Related resources
Attachments
To access additional content that is associated with this document, unzip the following file: attachment.zip