Run unit tests for Python ETL jobs in AWS Glue using the pytest framework - AWS Prescriptive Guidance

Run unit tests for Python ETL jobs in AWS Glue using the pytest framework

Created by Praveen Kumar Jeyarajan (AWS) and Vaidy Sankaran (AWS)

Code repository: ws-glue-jobs-unit-testing

Environment: Production

Technologies: DevOps; Big data; Software development & testing

AWS services: AWS CloudFormation; AWS CodeBuild; AWS CodeCommit; AWS CodePipeline; AWS Glue

Summary

You can run unit tests for Python extract, transform, and load (ETL) jobs for AWS Glue in a local development environment, but replicating those tests in a DevOps pipeline can be difficult and time consuming. Unit testing can be especially challenging when you’re modernizing mainframe ETL process on AWS technology stacks. This pattern shows you how to simplify unit testing, while keeping existing functionality intact, avoiding disruptions to key application functionality when you release new features, and maintaining high-quality software. You can use the steps and code samples in this pattern to run unit tests for Python ETL jobs in AWS Glue by using the pytest framework in AWS CodePipeline. You can also use this pattern to test and deploy multiple AWS Glue jobs.

Prerequisites and limitations

Prerequisites

  • An active AWS account

  • An Amazon Elastic Container Registry (Amazon ECR) image URI for your AWS Glue library, downloaded from the Amazon ECR Public Gallery

  • Bash terminal (on any operating system) with a profile for the target AWS account and AWS Region

  • Python 3.9 or later

  • Pytest

Architecture

Technology stack

  • Amazon Elastic Container Registry (Amazon ECR)

  • AWS CodeBuild

  • AWS CodeCommit

  • AWS CodePipeline

  • AWS Glue

  • Pytest

  • Python

  • Python ETL library for AWS Glue

Target architecture

The following diagram describes how to incorporate unit testing for AWS Glue ETL processes that are based on Python into a typical enterprise-scale AWS DevOps pipeline.

Unit testing for AWS Glue ETL processes

The diagram shows the following workflow:

  1. In the source stage, CodePipeline uses a CodeCommit repository for source code, including a sample Python ETL job (sample.py), a unit test file (test_sample.py), and an AWS CloudFormation template. Then, CodePipeline transfers the most recent code from the main branch to the CodeBuild project for further processing.

  2. In the build and publish stage, the most recent code from the previous source stage is unit tested with the help of an AWS Glue public Amazon ECR image. Then, the test report is published to CodeBuild report groups. The container image in the public Amazon ECR repository for AWS Glue libraries includes all the binaries required to run and unit test PySpark-based ETL tasks in AWS Glue locally. The public container repository has three image tags, one for each version supported by AWS Glue. For demonstration purposes, this pattern uses the glue_libs_3.0.0_image_01 image tag. To use this container image as a runtime image in CodeBuild, copy the image URI that corresponds to the image tag that you intend to use, and then update the pipeline.yml file in the GitHub repository for the TestBuild resource.

  3. In the deploy stage, the CodeBuild project is launched and it publishes the code to an Amazon Simple Storage Service (Amazon S3) bucket if all the tests pass.

  4. The user deploys the AWS Glue task by using the CloudFormation template in the deploy folder.

Tools

AWS tools

  • Amazon Elastic Container Registry (Amazon ECR) is a managed container image registry service that’s secure, scalable, and reliable.

  • AWS CodeBuild is a fully managed build service that helps you compile source code, run unit tests, and produce artifacts that are ready to deploy.

  • AWS CodeCommit is a version control service that helps you privately store and manage Git repositories, without needing to manage your own source control system.

  • AWS CodePipeline helps you quickly model and configure the different stages of a software release and automate the steps required to release software changes continuously.

  • AWS Glue is a fully managed ETL service. It helps you reliably categorize, clean, enrich, and move data between data stores and data streams.

Other tools

  • Pytest is a framework for writing small unit tests that scale to support complex functional testing for applications and libraries.

  • Python is a high-level, interpreted general purpose programming language.

  • Python ETL library for AWS Glue is a repository for Python libraries that are used in the local development of PySpark batch jobs for AWS Glue.

Code

The code for this pattern is available in the GitHub aws-glue-jobs-unit-testing repository. The repository includes the following resources:

  • A sample Python-based AWS Glue job in the src folder

  • Associated unit test cases (built using the pytest framework) in the tests folder

  • A CloudFormation template (written in YAML) in the deploy folder

Best practices

Security for CodePipeline resources

It’s a best practice to use encryption and authentication for the source repositories that connect to your pipelines in CodePipeline. For more information, see Security best practices in the CodePipeline documentation.

Monitoring and logging for CodePipeline resources

It’s a best practice to use AWS logging features to determine what actions users take in your account and what resources they use. The log files show the following:

  • Time and date of actions

  • Source IP address of actions

  • Which actions failed due to inadequate permissions

Logging features are available in AWS CloudTrail and Amazon CloudWatch Events. You can use CloudTrail to log AWS API calls and related events made by or on behalf of your AWS account. For more information, see Logging CodePipeline API calls with AWS CloudTrail in the CodePipeline documentation.

You can use CloudWatch Events to monitor your AWS Cloud resources and applications running on AWS. You can also create alerts in CloudWatch Events. For more information, see Monitoring CodePipeline events in the CodePipeline documentation.

Epics

TaskDescriptionSkills required

Prepare the code archive for deployment.

  1. Download code.zip from the GitHub aws-glue-jobs-unit-testing repository, or create the .zip file yourself by using a command-line tool. For example, you can create the .zip file on Linux or Mac by running the following commands in the terminal:

    git clone https://github.com/aws-samples/aws-glue-jobs-unit-testing.git cd aws-glue-jobs-unit-testing git checkout master zip -r code.zip src/ tests/ deploy/
  2. Sign in to the AWS Management Console and choose the AWS Region of your choice.

  3. Create an S3 bucket, and then upload the .zip package and code.zip file (downloaded earlier) to the S3 bucket that you created.

DevOps engineer

Create the CloudFormation stack.

  1. Sign in to the AWS Management Console and then open the CloudFormation console.

  2. Choose Create stack, and then choose With existing resources (import resources).

  3. In the Specify template section of the Create stack page, choose Upload a template file, and then choose the pipeline.yml template (downloaded from the GitHub repository). Then, choose Next.

  4. For Stack name, enter glue-unit-testing-pipeline, or choose a stack name of your choice.

  5. For ApplicationStackName, use the prepopulated glue-codepipeline-app name. This is the name of the CloudFormation stack that’s created by the pipeline.

  6. For BranchName, use the prepopulated master name. This is the name of the branch created in the CodeCommit repository to check in the code from the .zip file for the S3 bucket.

  7. For BucketName, use the prepopulated aws-glue-artifacts-us-east-1 bucket name. This is the name of the S3 bucket that contains the .zip file and is used by the pipeline to store code artifacts.

  8. For CodeZipFile, use the prepopulated code.zip value. This is the key name of the sample code S3 object. The object should be a .zip file.

  9. For RepositoryName, use the prepopulated aws-glue-unit-testing name. This is the name of the CodeCommit repository that’s created by the stack.

  10. For TestReportGroupName, use the prepopulated glue-unittest-report name. This is the name of the CodeBuild test report group that’s created to store the unit test reports.

  11. Choose Next, and then choose Next again on the Configure stack options page.

  12. On the Review page, under Capabilities, choose the I acknowledge that CloudFormation might create IAM resources with custom names option.

  13. Choose Submit. After the stack creation is complete, you can see the created resources on the Resources tab. The stack creation takes approximately 5-7 minutes.

The stack automatically creates a CodeCommit repository with the initial code that was checked in from the .zip file and uploaded to the S3 bucket. Furthermore, the stack creates a CodePipeline view using the CodeCommit repository as the source. In the steps above, the CodeCommit repository is aws-glue-unit-test, and the pipeline is aws-glue-unit-test-pipeline.

AWS DevOps, DevOps engineer

Clean up the resources in your environment.

To avoid additional infrastructure costs, make sure that you delete the stack after experimenting with the examples provided in this pattern.

  1. Open the CloudFormation console, and then select the stack that you created.

  2. Choose Delete. This deletes all the resources that your stack created, including CodeCommit repositories, AWS Identity and Access Management (IAM) roles or policies, and CodeBuild projects.

AWS DevOps, DevOps engineer
TaskDescriptionSkills required

Run the unit tests in the pipeline.

  1. To test the deployed pipeline, sign in to the AWS Management Console, and then open the CodePipeline console.

  2. Select the pipeline created by the CloudFormation stack, and then choose Release change. The pipeline starts running (using the most recent code in the CodeCommit repository).

  3. After the Test_and_Build phase is finished, choose the Details tab, and then examine the logs.

  4. Choose the Reports tab, and then choose the test report from Report history to view the unit test results.

  5. After the deployment stage is complete, run and monitor the deployed AWS Glue job on the AWS Glue console. For more information, see Monitoring AWS Glue in the AWS Glue documentation.

AWS DevOps, DevOps engineer

Troubleshooting

IssueSolution

A pipeline with an Amazon S3, Amazon ECR, or CodeCommit source no longer starts automatically

If you change any configuration settings for an action that uses event rules in Amazon EventBridge or CloudWatch Events for change detection, the AWS Management Console might not detect a change where source identifiers are similar and have identical initial characters. Because the new event rule is not created by the console, the pipeline no longer starts automatically.

For example, changing a CodeCommit branch name from MyTestBranch-1 to MyTestBranch-2 is a minor change. Because the change is at the end of the branch name, the event rule for the source action might not update or create a rule for the new source settings.

This applies to the following source actions that use events in CloudWatch Events for change detection:

  • The S3 bucket name and S3 object key parameters or console identifiers when the source action is in Amazon S3

  • The repository name and image tag parameters or console identifiers when the source action is in Amazon ECR

  • The repository name and branch name parameters or console identifiers when the source action is in CodeCommit

To resolve the issue, do one of the following:

  • Change the configuration settings in Amazon S3, Amazon ECR, or CodeCommit, so that changes are made to the starting portion of the parameter value. For example, change your branch name from release-branch to 2nd-release-branch. Avoid a change at the end of the name, such as release-branch-2.

  • Change the configuration settings in Amazon S3, Amazon ECR, or CodeCommit for each pipeline. For example, change your branch name from myRepo/myBranch to myDeployRepo/myDeployBranch. Avoid a change at the end of the name, such as myRepo/myBranch2.

  • Instead of using the AWS Management Console, use the AWS Command Line Interface (AWS CLI) or AWS CloudFormation to create and update your change-detection event rules. For instructions on creating event rules for an Amazon S3 source action, see Amazon S3 source actions and CloudWatch Events. For instructions on creating event rules for an Amazon ECR action, see Amazon ECR source actions and CloudWatch Events. For instructions on creating event rules for a CodeCommit action, see CodeCommit source actions and CloudWatch Events. After you edit your action configuration in the console, accept the updated change-detection resources created by the console.

Related resources