Developing and testing AWS Glue job scripts locally - AWS Glue

Developing and testing AWS Glue job scripts locally

When you develop and test your AWS Glue for Spark job scripts, there are multiple available options:

  • AWS Glue Studio console

    • Visual editor

    • Script editor

    • AWS Glue Studio notebook

  • Interactive sessions

    • Jupyter notebook

  • Docker image

    • Local development

    • Remote development

  • AWS Glue Studio ETL library

    • Local development

You can choose any of the above options based on your requirements.

If you prefer no code or less code experience, the AWS Glue Studio visual editor is a good choice.

If you prefer an interactive notebook experience, AWS Glue Studio notebook is a good choice. For more information, see Using Notebooks with AWS Glue Studio and AWS Glue. If you want to use your own local environment, interactive sessions is a good choice. For more information, see Using interactive sessions with AWS Glue.

If you prefer local/remote development experience, the Docker image is a good choice. This helps you to develop and test AWS Glue for Spark job scripts anywhere you prefer without incurring AWS Glue cost.

If you prefer local development without Docker, installing the AWS Glue ETL library directory locally is a good choice.

Developing using AWS Glue Studio

The AWS Glue Studio visual editor is a graphical interface that makes it easy to create, run, and monitor extract, transform, and load (ETL) jobs in AWS Glue. You can visually compose data transformation workflows and seamlessly run them on AWS Glue's Apache Spark-based serverless ETL engine. You can inspect the schema and data results in each step of the job. For more information, see the AWS Glue Studio User Guide.

Developing using interactive sessions

Interactive sessions allow you to build and test applications from the environment of your choice. For more information, see Using interactive sessions with AWS Glue.

Developing using a Docker image

Note

The instructions in this section have not been tested on Microsoft Windows operating systems.

For local development and testing on Windows platforms, see the blog Building an AWS Glue ETL pipeline locally without an AWS account

For a production-ready data platform, the development process and CI/CD pipeline for AWS Glue jobs is a key topic. You can flexibly develop and test AWS Glue jobs in a Docker container. AWS Glue hosts Docker images on Docker Hub to set up your development environment with additional utilities. You can use your preferred IDE, notebook, or REPL using AWS Glue ETL library. This topic describes how to develop and test AWS Glue version 4.0 jobs in a Docker container using a Docker image.

The following Docker images are available for AWS Glue on Docker Hub.

  • For AWS Glue version 4.0: amazon/aws-glue-libs:glue_libs_4.0.0_image_01

  • For AWS Glue version 3.0: amazon/aws-glue-libs:glue_libs_3.0.0_image_01

  • For AWS Glue version 2.0: amazon/aws-glue-libs:glue_libs_2.0.0_image_01

These images are for x86_64. It is recommended that you test on this architecture. However, it may be possible to rework a local development solution on unsupported base images.

This example describes using amazon/aws-glue-libs:glue_libs_4.0.0_image_01 and running the container on a local machine. This container image has been tested for an AWS Glue version 3.3 Spark jobs. This image contains the following:

  • Amazon Linux

  • AWS Glue ETL library (aws-glue-libs)

  • Apache Spark 3.3.0

  • Spark history server

  • Jupyter Lab

  • Livy

  • Other library dependencies (the same set as the ones of AWS Glue job system)

Complete one of the following sections according to your requirements:

  • Set up the container to use spark-submit

  • Set up the container to use REPL shell (PySpark)

  • Set up the container to use Pytest

  • Set up the container to use Jupyter Lab

  • Set up the container to use Visual Studio Code

Prerequisites

Before you start, make sure that Docker is installed and the Docker daemon is running. For installation instructions, see the Docker documentation for Mac or Linux. The machine running the Docker hosts the AWS Glue container. Also make sure that you have at least 7 GB of disk space for the image on the host running the Docker.

For more information about restrictions when developing AWS Glue code locally, see Local development restrictions.

Configuring AWS

To enable AWS API calls from the container, set up AWS credentials by following steps. In the following sections, we will use this AWS named profile.

  1. Set up the AWS CLI, configuring a named profile. For more information about AWS CLI configuration, see Configuration and credential file settings in the AWS CLI documentation.

  2. Run the following command in a terminal:

    PROFILE_NAME="<your_profile_name>"

You may also need to set the AWS_REGION environment variable to specify the AWS Region to send requests to.

Setting up and running the container

Setting up the container to run PySpark code through the spark-submit command includes the following high-level steps:

  1. Pull the image from Docker Hub.

  2. Run the container.

Pulling the image from Docker Hub

Run the following command to pull the image from Docker Hub:

docker pull amazon/aws-glue-libs:glue_libs_4.0.0_image_01

Running the container

You can now run a container using this image. You can choose any of following based on your requirements.

spark-submit

You can run an AWS Glue job script by running the spark-submit command on the container.

  1. Write the script and save it as sample1.py under the /local_path_to_workspace directory. Sample code is included as the appendix in this topic.

    $ WORKSPACE_LOCATION=/local_path_to_workspace $ SCRIPT_FILE_NAME=sample.py $ mkdir -p ${WORKSPACE_LOCATION}/src $ vim ${WORKSPACE_LOCATION}/src/${SCRIPT_FILE_NAME}
  2. Run the following command to execute the spark-submit command on the container to submit a new Spark application:

    $ docker run -it -v ~/.aws:/home/glue_user/.aws -v $WORKSPACE_LOCATION:/home/glue_user/workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_spark_submit amazon/aws-glue-libs:glue_libs_4.0.0_image_01 spark-submit /home/glue_user/workspace/src/$SCRIPT_FILE_NAME ...22/01/26 09:08:55 INFO DAGScheduler: Job 0 finished: fromRDD at DynamicFrame.scala:305, took 3.639886 s root |-- family_name: string |-- name: string |-- links: array | |-- element: struct | | |-- note: string | | |-- url: string |-- gender: string |-- image: string |-- identifiers: array | |-- element: struct | | |-- scheme: string | | |-- identifier: string |-- other_names: array | |-- element: struct | | |-- lang: string | | |-- note: string | | |-- name: string |-- sort_name: string |-- images: array | |-- element: struct | | |-- url: string |-- given_name: string |-- birth_date: string |-- id: string |-- contact_details: array | |-- element: struct | | |-- type: string | | |-- value: string |-- death_date: string ...
  3. (Optionally) Configure spark-submit to match your environment. For example, you can pass your dependencies with the --jars configuration. For more information, consult Launching Applications with spark-submit in the Spark documentation.

REPL shell (Pyspark)

You can run REPL (read-eval-print loops) shell for interactive development.

Run the following command to execute the PySpark command on the container to start the REPL shell:

$ docker run -it -v ~/.aws:/home/glue_user/.aws -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pyspark amazon/aws-glue-libs:glue_libs_4.0.0_image_01 pyspark ... ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.1.1-amzn-0 /_/ Using Python version 3.7.10 (default, Jun 3 2021 00:02:01) Spark context Web UI available at http://56e99d000c99:4040 Spark context available as 'sc' (master = local[*], app id = local-1643011860812). SparkSession available as 'spark'. >>>
Pytest

For unit testing, you can use pytest for AWS Glue Spark job scripts.

Run the following commands for preparation.

$ WORKSPACE_LOCATION=/local_path_to_workspace $ SCRIPT_FILE_NAME=sample.py $ UNIT_TEST_FILE_NAME=test_sample.py $ mkdir -p ${WORKSPACE_LOCATION}/tests $ vim ${WORKSPACE_LOCATION}/tests/${UNIT_TEST_FILE_NAME}

Run the following command to execute pytest on the test suite:

$ docker run -it -v ~/.aws:/home/glue_user/.aws -v $WORKSPACE_LOCATION:/home/glue_user/workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pytest amazon/aws-glue-libs:glue_libs_4.0.0_image_01 -c "python3 -m pytest" starting org.apache.spark.deploy.history.HistoryServer, logging to /home/glue_user/spark/logs/spark-glue_user-org.apache.spark.deploy.history.HistoryServer-1-5168f209bd78.out *============================================================= test session starts ============================================================= *platform linux -- Python 3.7.10, pytest-6.2.3, py-1.11.0, pluggy-0.13.1 rootdir: /home/glue_user/workspace plugins: anyio-3.4.0 *collected 1 item * tests/test_sample.py . [100%] ============================================================== warnings summary =============================================================== tests/test_sample.py::test_counts /home/glue_user/spark/python/pyspark/sql/context.py:79: DeprecationWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead. DeprecationWarning) -- Docs: https://docs.pytest.org/en/stable/warnings.html ======================================================== 1 passed, *1 warning* in 21.07s ========================================================
Jupyter Lab

You can start Jupyter for interactive development and ad-hoc queries on notebooks.

  1. Run the following command to start Jupyter Lab:

    $ JUPYTER_WORKSPACE_LOCATION=/local_path_to_workspace/jupyter_workspace/ $ docker run -it -v ~/.aws:/home/glue_user/.aws -v $JUPYTER_WORKSPACE_LOCATION:/home/glue_user/workspace/jupyter_workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 -p 8998:8998 -p 8888:8888 --name glue_jupyter_lab amazon/aws-glue-libs:glue_libs_4.0.0_image_01 /home/glue_user/jupyter/jupyter_start.sh ... [I 2022-01-24 08:19:21.368 ServerApp] Serving notebooks from local directory: /home/glue_user/workspace/jupyter_workspace [I 2022-01-24 08:19:21.368 ServerApp] Jupyter Server 1.13.1 is running at: [I 2022-01-24 08:19:21.368 ServerApp] http://faa541f8f99f:8888/lab [I 2022-01-24 08:19:21.368 ServerApp] or http://127.0.0.1:8888/lab [I 2022-01-24 08:19:21.368 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
  2. Open http://127.0.0.1:8888/lab in your web browser in your local machine, to see the Jupyter lab UI.

    The Jupyter Lab UI.
  3. Choose Glue Spark Local (PySpark) under Notebook. You can start developing code in the interactive Jupyter notebook UI.

    Developing code in the notebook.

Setting up the container to use Visual Studio Code

Prerequisites:

  1. Install Visual Studio Code.

  2. Install Python.

  3. Install Visual Studio Code Remote - Containers

  4. Open the workspace folder in Visual Studio Code.

  5. Choose Settings.

  6. Choose Workspace.

  7. Choose Open Settings (JSON).

  8. Paste the following JSON and save it.

    { "python.defaultInterpreterPath": "/usr/bin/python3", "python.analysis.extraPaths": [ "/home/glue_user/aws-glue-libs/PyGlue.zip:/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip:/home/glue_user/spark/python/", ] }

Steps:

  1. Run the Docker container.

    $ docker run -it -v ~/.aws:/home/glue_user/.aws -v $WORKSPACE_LOCATION:/home/glue_user/workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pyspark amazon/aws-glue-libs:glue_libs_4.0.0_image_01 pyspark
  2. Start Visual Studio Code.

  3. Choose Remote Explorer on the left menu, and choose amazon/aws-glue-libs:glue_libs_4.0.0_image_01.

    The library in Visual Studio Code.
  4. Right click and choose Attach to Container. If a dialog is shown, choose Got it.

  5. Open /home/glue_user/workspace/.

  6. Create a Glue PySpark script and choose Run.

    You will see the successful run of the script.

    The successful run of the script.

Appendix: AWS Glue job sample code for testing

This appendix provides scripts as AWS Glue job sample code for testing purposes.

sample.py: Sample code to utilize the AWS Glue ETL library with an Amazon S3 API call

import sys from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job from awsglue.utils import getResolvedOptions class GluePythonSampleTest: def __init__(self): params = [] if '--JOB_NAME' in sys.argv: params.append('JOB_NAME') args = getResolvedOptions(sys.argv, params) self.context = GlueContext(SparkContext.getOrCreate()) self.job = Job(self.context) if 'JOB_NAME' in args: jobname = args['JOB_NAME'] else: jobname = "test" self.job.init(jobname, args) def run(self): dyf = read_json(self.context, "s3://awsglue-datasets/examples/us-legislators/all/persons.json") dyf.printSchema() self.job.commit() def read_json(glue_context, path): dynamicframe = glue_context.create_dynamic_frame.from_options( connection_type='s3', connection_options={ 'paths': [path], 'recurse': True }, format='json' ) return dynamicframe if __name__ == '__main__': GluePythonSampleTest().run()

The above code requires Amazon S3 permissions in AWS IAM. You need to grant the IAM managed policy arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess or an IAM custom policy which allows you to call ListBucket and GetObject for the Amazon S3 path.

test_sample.py: Sample code for unit test of sample.py.

import pytest from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job from awsglue.utils import getResolvedOptions import sys from src import sample @pytest.fixture(scope="module", autouse=True) def glue_context(): sys.argv.append('--JOB_NAME') sys.argv.append('test_count') args = getResolvedOptions(sys.argv, ['JOB_NAME']) context = GlueContext(SparkContext.getOrCreate()) job = Job(context) job.init(args['JOB_NAME'], args) yield(context) job.commit() def test_counts(glue_context): dyf = sample.read_json(glue_context, "s3://awsglue-datasets/examples/us-legislators/all/persons.json") assert dyf.toDF().count() == 1961

Developing using the AWS Glue ETL library

The AWS Glue ETL library is available in a public Amazon S3 bucket, and can be consumed by the Apache Maven build system. This enables you to develop and test your Python and Scala extract, transform, and load (ETL) scripts locally, without the need for a network connection. Local development with the Docker image is recommended, as it provides an environment properly configured for the use of this library.

Local development is available for all AWS Glue versions, including AWS Glue version 0.9, 1.0, 2.0, and later. For information about the versions of Python and Apache Spark that are available with AWS Glue, see the Glue version job property.

The library is released with the Amazon Software license (https://aws.amazon.com/asl).

Local development restrictions

Keep the following restrictions in mind when using the AWS Glue Scala library to develop locally.

Developing locally with Python

Complete some prerequisite steps and then use AWS Glue utilities to test and submit your Python ETL script.

Prerequisites for local Python development

Complete these steps to prepare for local Python development:

  1. Clone the AWS Glue Python repository from GitHub (https://github.com/awslabs/aws-glue-libs).

  2. Do one of the following:

    • For AWS Glue version 0.9, check out branch glue-0.9.

    • For AWS Glue versions 1.0, check out branch glue-1.0. All versions above AWS Glue 0.9 support Python 3.

    • For AWS Glue versions 2.0, check out branch glue-2.0.

    • For AWS Glue versions 3.0, check out branch glue-3.0.

    • For AWS Glue version 4.0, check out the master branch.

  3. Install Apache Maven from the following location: https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-common/apache-maven-3.6.0-bin.tar.gz.

  4. Install the Apache Spark distribution from one of the following locations:

  5. Export the SPARK_HOME environment variable, setting it to the root location extracted from the Spark archive. For example:

    • For AWS Glue version 0.9: export SPARK_HOME=/home/$USER/spark-2.2.1-bin-hadoop2.7

    • For AWS Glue version 1.0 and 2.0: export SPARK_HOME=/home/$USER/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8

    • For AWS Glue version 3.0: export SPARK_HOME=/home/$USER/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3

    • For AWS Glue version 4.0: export SPARK_HOME=/home/$USER/spark-3.3.0-amzn-1-bin-3.3.3-amzn-0

Running your Python ETL script

With the AWS Glue jar files available for local development, you can run the AWS Glue Python package locally.

Use the following utilities and frameworks to test and run your Python script. The commands listed in the following table are run from the root directory of the AWS Glue Python package.

Utility Command Description
AWS Glue Shell ./bin/gluepyspark Enter and run Python scripts in a shell that integrates with AWS Glue ETL libraries.
AWS Glue Submit ./bin/gluesparksubmit Submit a complete Python script for execution.
Pytest ./bin/gluepytest Write and run unit tests of your Python code. The pytest module must be installed and available in the PATH. For more information, see the pytest documentation.

Developing locally with Scala

Complete some prerequisite steps and then issue a Maven command to run your Scala ETL script locally.

Prerequisites for local Scala development

Complete these steps to prepare for local Scala development.

Step 1: Install software

In this step, you install software and set the required environment variable.

  1. Install Apache Maven from the following location: https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-common/apache-maven-3.6.0-bin.tar.gz.

  2. Install the Apache Spark distribution from one of the following locations:

  3. Export the SPARK_HOME environment variable, setting it to the root location extracted from the Spark archive. For example:

    • For AWS Glue version 0.9: export SPARK_HOME=/home/$USER/spark-2.2.1-bin-hadoop2.7

    • For AWS Glue version 1.0 and 2.0: export SPARK_HOME=/home/$USER/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8

    • For AWS Glue version 3.0: export SPARK_HOME=/home/$USER/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3

    • For AWS Glue version 4.0: export SPARK_HOME=/home/$USER/spark-3.3.0-amzn-1-bin-3.3.3-amzn-0

Step 2: Configure your Maven project

Use the following pom.xml file as a template for your AWS Glue Scala applications. It contains the required dependencies, repositories, and plugins elements. Replace the Glue version string with one of the following:

  • 4.0.0 for AWS Glue version 4.0

  • 3.0.0 for AWS Glue version 3.0

  • 1.0.0 for AWS Glue version 1.0 or 2.0

  • 0.9.0 for AWS Glue version 0.9

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.amazonaws</groupId> <artifactId>AWSGlueApp</artifactId> <version>1.0-SNAPSHOT</version> <name>${project.artifactId}</name> <description>AWS ETL application</description> <properties> <scala.version>2.11.1 for AWS Glue 2.0 or below, 2.12.7 for AWS Glue 3.0 and 4.0</scala.version> <glue.version>Glue version with three numbers (as mentioned earlier)</glue.version> </properties> <dependencies> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>${scala.version}</version> <!-- A "provided" dependency, this will be ignored when you package your application --> <scope>provided</scope> </dependency> <dependency> <groupId>com.amazonaws</groupId> <artifactId>AWSGlueETL</artifactId> <version>${glue.version}</version> <!-- A "provided" dependency, this will be ignored when you package your application --> <scope>provided</scope> </dependency> </dependencies> <repositories> <repository> <id>aws-glue-etl-artifacts</id> <url>https://aws-glue-etl-artifacts.s3.amazonaws.com/release/</url> </repository> </repositories> <build> <sourceDirectory>src/main/scala</sourceDirectory> <plugins> <plugin> <!-- see http://davidb.github.com/scala-maven-plugin --> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>3.4.0</version> <executions> <execution> <goals> <goal>compile</goal> <goal>testCompile</goal> </goals> </execution> </executions> </plugin> <plugin> <groupId>org.codehaus.mojo</groupId> <artifactId>exec-maven-plugin</artifactId> <version>1.6.0</version> <executions> <execution> <goals> <goal>java</goal> </goals> </execution> </executions> <configuration> <systemProperties> <systemProperty> <key>spark.master</key> <value>local[*]</value> </systemProperty> <systemProperty> <key>spark.app.name</key> <value>localrun</value> </systemProperty> <systemProperty> <key>org.xerial.snappy.lib.name</key> <value>libsnappyjava.jnilib</value> </systemProperty> </systemProperties> </configuration> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-enforcer-plugin</artifactId> <version>3.0.0-M2</version> <executions> <execution> <id>enforce-maven</id> <goals> <goal>enforce</goal> </goals> <configuration> <rules> <requireMavenVersion> <version>3.5.3</version> </requireMavenVersion> </rules> </configuration> </execution> </executions> </plugin> <!-- The shade plugin will be helpful in building a uberjar or fatjar. You can use this jar in the AWS Glue runtime environment. For more information, see https://maven.apache.org/plugins/maven-shade-plugin/ --> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>3.2.4</version> <configuration> <!-- any other shade configurations --> </configuration> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> </execution> </executions> </plugin> </plugins> </build> </project>

Running your Scala ETL script

Run the following command from the Maven project root directory to run your Scala ETL script.

mvn exec:java -Dexec.mainClass="mainClass" -Dexec.args="--JOB-NAME jobName"

Replace mainClass with the fully qualified class name of the script's main class. Replace jobName with the desired job name.

Configuring a test environment

For examples of configuring a local test environment, see the following blog articles:

If you want to use development endpoints or notebooks for testing your ETL scripts, see Developing scripts using development endpoints.

Note

Development endpoints are not supported for use with AWS Glue version 2.0 jobs. For more information, see Running Spark ETL Jobs with Reduced Startup Times.