Menu
AWS Glue
Developer Guide

Tutorial Setup: Prerequisites for the Development Endpoint Tutorials

Development endpoints create an environment where you can interactively test and debug ETL scripts in various ways before you run them as AWS Glue jobs. The tutorials in this section show you how to do this using different IDEs. All of them assume that you have set up a development endpoint and crawled sample data to create tables in your AWS Glue Data Catalog using the steps in the following sections.

Note

Your ETL scripts must target Python 2.7, because AWS Glue development endpoints do not support Python 3 yet.

Because you're using only Amazon Simple Storage Service (Amazon S3) data in some cases, and a mix of JDBC and Amazon S3 data in others, you will set up one development endpoint that is not in a virtual private cloud (VPC) and one that is.

Crawling the Sample Data Used in the Tutorials

The first step is to create a crawler that can crawl some sample data and record metadata about it in tables in your Data Catalog. The sample data that is used is drawn from http://everypolitician.org/ and has been modified slightly for purposes of the tutorials. It contains data in JSON format about United States legislators and the seats that they have held in the US House of Representatives and Senate.

  1. Sign in to the AWS Management Console and open the AWS Glue console at https://console.aws.amazon.com/glue/.

    In the AWS Glue console, choose Databases in the navigation pane, and then choose Add database. Name the database legislators.

  2. Choose Crawlers, and then choose Add crawler. Name the crawler legislator_crawler, assign it your AWS Glue role, and then choose Next.

  3. Leave Amazon S3 as the data store. Under Crawl data in, choose Specified path in another account. Then in the Include path box, type s3://awsglue-datasets/examples/us-legislators/all. Choose Next, and then choose Next again to confirm that you don't want to add another data store. Then choose Next to confirm that this crawler will be run on demand.

  4. For Database, choose the legislators database. Choose Next, and then choose Finish to complete the creation of the new crawler.

  5. Choose Crawlers in the navigation pane again. Select the check box next to the new legislator_crawler crawler, and choose Run crawler.

  6. Choose Databases in the navigation pane. Choose the legislators database, and then choose Tables in legislators. You should see six tables created by the crawler in your Data Catalog, containing metadata that the crawler retrieved.

Creating a Development Endpoint for Amazon S3 Data

The next thing to do is to create a development endpoint for Amazon S3 data. When you use a JDBC data source or target, the development endpoint must be created in a VPC. However, this isn't necessary if you are only accessing Amazon S3.

  1. In the AWS Glue console, choose Dev endpoints. Choose Add endpoint.

  2. Specify an endpoint name, such as demo-endpoint.

  3. Choose an IAM role with permissions similar to the IAM role that you use to run AWS Glue ETL jobs. For more information, see Step 2: Create an IAM Role for AWS Glue. Choose Next.

  4. In Networking, leave Skip networking information selected, and choose Next.

  5. In SSH Public Key, enter a public key generated by an SSH key generator program (do not use an Amazon EC2 key pair). Save the corresponding private key to later connect to the development endpoint using SSH. Choose Next.

  6. In Review, choose Finish. After the development endpoint is created, wait for its provisioning status to move to READY.

Creating an Amazon S3 Location to Use for Output

If you don't already have a bucket, follow the instructions in Create a Bucket to set one up in Amazon S3 where you can save output from sample ETL scripts.

Creating a Development Endpoint in a VPC

The next thing to do is to create a development endpoint within a virtual private cloud (Amazon VPC) that you can use to access datasets using JDBC. Before you start, set up your environment as explained in Setting Up Your Environment for Development Endpoints.

  1. In the AWS Glue console, choose Dev endpoints in the navigation pane. Then choose Add endpoint.

  2. Specify an endpoint name, such as vpc-demo-endpoint.

  3. Choose an IAM role with permissions similar to the IAM role that you use to run AWS Glue ETL jobs. For more information, see Step 2: Create an IAM Role for AWS Glue. Choose Next.

  4. In Networking, specify an Amazon VPC, a subnet, and security groups. This information is used to create a development endpoint that can connect to your data resources securely. Consider the following suggestions when filling in the properties of your endpoint:

    • If you already set up a connection to your data stores, you can use the same connection to determine the Amazon VPC, subnet, and security groups for your endpoint. Otherwise, specify these parameters individually.

    • Ensure that your Amazon VPC has Edit DNS hostnames set to yes. This parameter can be set in the Amazon VPC console (https://console.aws.amazon.com/vpc/). For more information, see Setting Up DNS in Your VPC.

    • For this tutorial, ensure that the Amazon VPC you select has an Amazon S3 VPC endpoint. For information about how to create an Amazon S3 VPC endpoint, see Amazon VPC Endpoints for Amazon S3.

    • Choose a public subnet for your development endpoint. You can make a subnet a public subnet by adding a route to an internet gateway. For IPv4 traffic, create a route with Destination 0.0.0.0/0 and Target the internet gateway ID. Your subnet’s route table should be associated with an internet gateway, not a NAT gateway. This information can be set in the Amazon VPC console (https://console.aws.amazon.com/vpc/). For example:

      An example of a route table with an internet gateway.

      For more information, see Route tables for Internet Gateways. For information about how to create an internet gateway, see Internet Gateways.

    • Ensure that you choose a security group that has an inbound self-reference rule. This information can be set in the Amazon VPC console (https://console.aws.amazon.com/vpc/). For example:

      An example of a self-referencing inbound rule.

      For more information about how to set up your subnet, see Setting Up Your Environment for Development Endpoints.

    Choose Next.

  5. In SSH Public Key, enter a public key generated by an SSH key generator program (do not use an Amazon EC2 key pair). Save the corresponding private key to later connect to the development endpoint using SSH. Choose Next.

  6. In Review, choose Finish. After the development endpoint is created, wait for its provisioning status to move to READY.

You are now ready to try out the tutorials in this section: