AWS Glue
Developer Guide

Tutorial: Set Up a REPL Shell to Author ETL Scripts Interactively

The goal of this tutorial is to create an environment where extract, transform, and load (ETL) scripts can be easily written and then ported to run as AWS Glue jobs. In AWS Glue, you can create a development endpoint and then use a REPL (Read–Eval–Print Loop) shell to test code in your AWS Glue scripts. This scenario demonstrates the basic setup of the environment.


Step 1: To Create a REPL Shell

  1. In the AWS Glue console, navigate to the development endpoints list. Choose the development endpoint name to open its details page.

    You use these details to open a REPL shell from your development endpoint.

  2. On your local computer, open a terminal window that can run SSH commands. Navigate to the folder where you downloaded the private SSH key that is associated with your development endpoint. This key corresponds to the public SSH key that you used when you created or updated the development endpoint. To protect your private key from accidental overwriting, type the following:

    chmod 400 private-key

    For example:

    chmod 400 my-name.pem
  3. Connect to your development endpoint with the SSH to REPL command of the endpoint found on the details page. Edit the command to specify the location and name of your private-key.pem that allows you to connect to the development endpoint. In this command, you start gluepyspark to run AWS Glue commands.

    For example:

    ssh -i privatekey.pem -t gluepyspark

    You see output similar to this:

    Python 2.7.12 (default, Sep 1 2016, 22:14:00) [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/share/aws/glue/etl/jars/glue-assembly.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 17/07/27 17:25:15 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. 17/07/27 17:25:29 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.1.0 /_/ Using Python version 2.7.12 (default, Sep 1 2016 22:14:00) SparkSession available as 'spark'. >>>

    For more information about the syntax of SSH, see SSH commands.

  4. Verify that your REPL is set up correctly by typing the statement print spark.version and running it. It returns the version of Apache Spark that is running on your development endpoint. Your environment is now ready to run Spark with AWS Glue.

  5. For example, run a script similar to the following in your REPL. This script reads the schema from the flightscsv table and displays the same. It also displays data from the table. Notice the spark.stop() statement in the script.

    from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.transforms import SelectFields spark.stop() glueContext = GlueContext(SparkContext.getOrCreate()) datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "flights-db", table_name = "flightscsv", transformation_ctx = "datasource0") datasource0.printSchema() df = datasource0.toDF()