Menu
AWS Glue
Developer Guide

Tutorial: Set Up a Development Endpoint and Notebook to Author ETL Scripts Interactively

The goal of this tutorial is to create an environment in which you can create ETL (extract, transform, and load) scripts that can easily be ported to run as AWS Glue jobs. AWS Glue lets you create a development endpoint, spin up an Amazon EC2 cluster to run Apache Zeppelin notebooks, and create and test AWS Glue scripts. In this scenario, you query publicly available airline flight data.

The following concepts can help you understand the steps in this tutorial.

  • A development endpoint is set up similar to the AWS Glue serverless environment. When you use a development endpoint, you can develop ETL scripts that can be ported to run using AWS Glue.

  • One use of this endpoint is to create a notebook. An Apache Zeppelin notebook is a web-based notebook that enables interactive data analytics. The Zeppelin notebook is provisioned on an Amazon EC2 instance with access to AWS Glue libraries. Charges for using Amazon EC2 are separate from AWS Glue. You can view your Amazon EC2 instances in the Amazon EC2 console (https://console.aws.amazon.com/ec2/).

    An AWS CloudFormation stack is used to create the environment for the notebook. You can view the AWS CloudFormation stack in the AWS CloudFormation console (https://console.aws.amazon.com/cloudformation).

In this example, you create a development endpoint that can be used to query flight data that is stored in Amazon Simple Storage Service (Amazon S3).

Prerequisites

  • Set up your environment to use development endpoints and notebook servers. For more information, see Setting Up Your Environment for Development Endpoints.

  • Sign in to the AWS Management Console and open the AWS Glue console at https://console.aws.amazon.com/glue/.

  • Run a crawler to catalog the flights public data set located at s3://athena-examples/flight/parquet/).

  • For more information about creating crawlers, see the Add crawler tutorial on the AWS Glue console. Configure the crawler to create tables in a database named flightsdb. Also define the table name prefix as flights. When the crawler run completes, verify that the flightsparquet table is available in your AWS Glue Data Catalog.

Note

The flight table data comes from Flights data provided by the U.S. Department of Transportation, Bureau of Transportation Statistics. Desaturated from original.

Step 1: To Create a Development Endpoint

  1. In the AWS Glue console, navigate to the development endpoints list. Choose Add endpoint.

  2. Specify an endpoint name; for example, demo-endpoint.

  3. Choose an IAM role with permissions similar to the IAM role that you use to run AWS Glue ETL jobs. For more information, see Step 2: Create an IAM Role for AWS Glue.

  4. Specify an Amazon VPC, a subnet, and security groups. This information is used to create a development endpoint to securely connect to your data resources and issue Apache Spark commands. Consider the following suggestions when filling in the properties of your endpoint:

    • If you already set up a connection to your data stores, you can use the same connection to determine the Amazon VPC, subnet, and security groups for your development endpoint. Otherwise, specify these parameters individually.

    • Ensure that your Amazon VPC has Edit DNS hostnames set to yes. This parameter can be set in the Amazon VPC console (https://console.aws.amazon.com/vpc/). For more information, see Setting Up DNS in Your VPC.

    • For this tutorial, ensure that the Amazon VPC you select has an Amazon S3 VPC endpoint. For information about how to create an Amazon S3 VPC endpoint, see Amazon VPC Endpoints for Amazon S3.

    • Select a public subnet for your development endpoint. You can make a subnet a public subnet by adding a route to an internet gateway. For IPv4 traffic, create a route with Destination 0.0.0.0/0 and Target the internet gateway ID. Your subnet’s route table should be associated with an internet gateway, not a NAT gateway. This information can be set in the Amazon VPC console (https://console.aws.amazon.com/vpc/). For example:

      
                                    An example of a route table with an internet
                                        gateway.

      For more information, see Route tables for Internet Gateways. For information about how to create an internet gateway, see Internet Gateways.

    • Ensure that you choose a security group that has an inbound self-reference rule. This information can be set in the Amazon VPC console (https://console.aws.amazon.com/vpc/). For example:

      
                                    An example of a self-referencing inbound rule.

      For more information about how to set up your subnet, see Setting Up Your Environment for Development Endpoints.

    • The public SSH key that you use for your development endpoint should not be an Amazon EC2 key pair. Generate the key with ssh-keygen, which typically can be found in a bash shell on a Mac or Git for Windows. The key is a 2048-bit SSH-2 RSA key.

  5. Choose Create. After the development endpoint is created, wait for its provisioning status to move to Ready. Then proceed to the next step.

Step 2: To Create an Apache Zeppelin Notebook Server

To perform this procedure, you must have permission to create resources in AWS CloudFormation, Amazon EC2, and other services. For more information about required user permissions, see Step 3: Attach a Policy to IAM Users That Access AWS Glue.

  1. In the AWS Glue console, navigate to the development endpoints list. Choose Actions, Create notebook server.

    To create the notebook, an Amazon EC2 instance is created using an AWS CloudFormation stack on your development endpoint. A Zeppelin notebook HTTP server is started on port 443.

  2. Specify the AWS CloudFormation stack server name, for example demo-cf.

  3. Choose an IAM role with a trust relationship to Amazon EC2. For more information, see Step 5: Create an IAM Role for Notebooks.

  4. Create or use an existing Amazon EC2 key pair with the Amazon EC2 console (https://console.aws.amazon.com/ec2/). Remember where your private key is downloaded. This key is different from the SSH key you used when creating your development endpoint. The keys that Amazon EC2 uses are 2048-bit SSH-2 RSA keys. For more information about Amazon EC2 keys, see Amazon EC2 Key Pairs.

  5. Choose a user name and password to access your Apache Zeppelin notebook.

  6. Choose an Amazon S3 path for your notebook state to be stored in.

  7. Choose Create.

    You can view the status of the AWS CloudFormation stack in the AWS CloudFormation console Events tab (https://console.aws.amazon.com/cloudformation). You can view the Amazon EC2 instances created by AWS CloudFormation in the Amazon EC2 console (https://console.aws.amazon.com/ec2/). Search for instances that are tagged with key aws-glue-dev-endpoint with a value of the name of the development endpoint.

    After the notebook is created, its status is changed to CREATE_COMPLETE in the Amazon EC2 console. Details about your notebook also appear in the development endpoint details page. When it's complete, go to the next step.

Step 3: To Connect to Your Apache Zeppelin Notebook

  1. In the AWS Glue console, navigate to the development endpoints list. Choose the development endpoint name to open its details page.

    Details about your notebook server are also described on this page. You use these details to connect your Apache Zeppelin notebook from your web browser.

  2. On your local computer, open a terminal window. Leave the terminal window open while you use the notebook. Navigate to the folder where you downloaded your Amazon EC2 private key. To protect your Amazon EC2 private key from accidental overwriting, type the following:

    Copy
    chmod 400 private-key

    For example:

    Copy
    chmod 400 my-name.pem
  3. Open a web browser, and type the Notebook URL in the browser address bar to access the notebook using HTTPS on port 443. For example:

    Copy
    https://ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com:443

    The Zeppelin notebook opens in your web browser. Log in to the notebook using the user name and password you provided when you created the notebook server.

  4. Create a new note, name it demo note. For the Default Interpreter, choose Spark.

  5. Verify that your notebook is set up correctly by typing the statement spark.version and running it. It returns the version of Apache Spark that is running on your notebook server.

  6. Type the following script into your notebook and run it. This script reads the schema from the flightsparquet table and displays the same. It also displays data from the table.

    Copy
    from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.transforms import SelectFields glueContext = GlueContext(spark.sparkContext) datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "flightsdb", table_name = "flightsparquet", transformation_ctx = "datasource0") datasource0.printSchema() df = datasource0.toDF() df.show()

    This script returns output similar to the following example:

    Copy
    2.1.0 -----------------+-------+-------+----------+---------------+----------+-----------------------+--------------------+ geolocation|classid|topicid|questionid|datavaluetypeid|locationid|stratificationcategory1| stratification1| ... -----------------+-------+-------+----------+---------------+----------+-----------------------+--------------------+ .8405711220004...| OWS| OWS1| Q036| VALUE| 1| Income| Data not reported| ... .8405711220004...| OWS| OWS1| Q037| VALUE| 1| Age (years)| 55 - 64| ... .8405711220004...| FV| FV1| Q018| VALUE| 1| Education| College graduate| ... .8405711220004...| FV| FV1| Q018| VALUE| 1| Education|Less than high sc...| ... .8405711220004...| FV| FV1| Q019| VALUE| 1| Income| $25,000 - $34,999| ...
  7. When a notebook server is run, Apache Zeppelin does not emit error messages on failure. To debug issues with your notebook, you can view Zeppelin logs. In a terminal window, navigate to the folder where you downloaded your Amazon EC2 private key. To access Zeppelin logs, from a terminal window, type the SSH to EC2 server command found on the details page. For example:

    Copy
    ssh -i private-key.pem ec2-user@ec2-xx-xxx-xxx-xxx.compute-1.amazonaws.com

    Then navigate to zeppelin/logs for your user.

  8. When you're finished, close your web browser and any open terminal windows.