AWS Glue
Developer Guide

Tutorial Setup: Prerequisites for the Development Endpoint Tutorials

Development endpoints create an environment where you can interactively test and debug ETL scripts in various ways before you run them as AWS Glue jobs. The tutorials in this section show you how to do this using different IDEs. All of them assume that you have set up a development endpoint and crawled sample data to create tables in your AWS Glue Data Catalog using the steps in the following sections.

Note

Your Python scripts must target Python 2.7, because AWS Glue development endpoints do not support Python 3 yet.

Because you're using only Amazon Simple Storage Service (Amazon S3) data in some cases, and a mix of JDBC and Amazon S3 data in others, you will set up one development endpoint that is not in a virtual private cloud (VPC) and one that is.

Crawling the Sample Data Used in the Tutorials

The first step is to create a crawler that can crawl some sample data and record metadata about it in tables in your Data Catalog. The sample data that is used is drawn from http://everypolitician.org/ and has been modified slightly for purposes of the tutorials. It contains data in JSON format about United States legislators and the seats that they have held in the US House of Representatives and Senate.

  1. Sign in to the AWS Management Console and open the AWS Glue console at https://console.aws.amazon.com/glue/.

    In the AWS Glue console, choose Databases in the navigation pane, and then choose Add database. Name the database legislators.

  2. Choose Crawlers, and then choose Add crawler. Name the crawler legislator_crawler, and then choose Next.

  3. Leave Amazon S3 as the data store. Under Crawl data in, choose Specified path in another account. Then in the Include path box, type s3://awsglue-datasets/examples/us-legislators/all. Choose Next, and then choose Next again to confirm that you don't want to add another data store.

  4. Provide an IAM role for the crawler to assume when it runs, choose Next. Then choose Next to confirm that this crawler will be run on demand.

  5. For Database, choose the legislators database. Choose Next, and then choose Finish to complete the creation of the new crawler.

  6. Choose Crawlers in the navigation pane again. Select the check box next to the new legislator_crawler crawler, and choose Run crawler.

  7. Choose Databases in the navigation pane. Choose the legislators database, and then choose Tables in legislators. You should see six tables created by the crawler in your Data Catalog, containing metadata that the crawler retrieved.

Creating a Development Endpoint for Amazon S3 Data

The next thing to do is to create a development endpoint for Amazon S3 data. When you use a JDBC data source or target, the development endpoint must be created with a VPC. However, this isn't necessary in this tutorial if you are only accessing Amazon S3.

  1. In the AWS Glue console, choose Dev endpoints. Choose Add endpoint.

  2. Specify an endpoint name, such as demo-endpoint.

  3. Choose an IAM role with permissions similar to the IAM role that you use to run AWS Glue ETL jobs. For more information, see Step 2: Create an IAM Role for AWS Glue. Choose Next.

  4. In Networking, leave Skip networking information selected, and choose Next.

  5. In SSH Public Key, enter a public key generated by an SSH key generator program, such as ssh-keygen (do not use an Amazon EC2 key pair). The generated public key will be imported into your development endpoint. Save the corresponding private key to later connect to the development endpoint using SSH. Choose Next. For more information, see ssh-keygen in Wikipedia.

    Note

    When generating the key on Microsoft Windows, use a current version of PuTTYgen and paste the public key into the AWS Glue console from the PuTTYgen window. Generate an RSA key. Do not upload a file with the public key, instead use the key generated in the field Public key for pasting into OpenSSH authorized_keys file. The corresponding private key (.ppk) can be used in PuTTY to connect to the development endpoint. To connect to the development endpoint with SSH on Windows, convert the private key from .ppk format to OpenSSH .pem format using the PuTTYgen Conversion menu. For more information, see Connecting to Your Linux Instance from Windows Using PuTTY.

  6. In Review, choose Finish. After the development endpoint is created, wait for its provisioning status to move to READY.

Creating an Amazon S3 Location to Use for Output

If you don't already have a bucket, follow the instructions in Create a Bucket to set one up in Amazon S3 where you can save output from sample ETL scripts.

Creating a Development Endpoint with a VPC

Although not required for this tutorial, a VPC development endpoint is needed if both Amazon S3 and JDBC data stores are accessed by your ETL statements. In this case, when you create a development endpoint you specify network properties of the virtual private cloud (Amazon VPC) that contains your JDBC data stores. Before you start, set up your environment as explained in Setting Up Your Environment for Development Endpoints.

  1. In the AWS Glue console, choose Dev endpoints in the navigation pane. Then choose Add endpoint.

  2. Specify an endpoint name, such as vpc-demo-endpoint.

  3. Choose an IAM role with permissions similar to the IAM role that you use to run AWS Glue ETL jobs. For more information, see Step 2: Create an IAM Role for AWS Glue. Choose Next.

  4. In Networking, specify an Amazon VPC, a subnet, and security groups. This information is used to create a development endpoint that can connect to your data resources securely. Consider the following suggestions when filling in the properties of your endpoint:

    • If you already set up a connection to your data stores, you can use the same connection to determine the Amazon VPC, subnet, and security groups for your endpoint. Otherwise, specify these parameters individually.

    • Ensure that your Amazon VPC has Edit DNS hostnames set to yes. This parameter can be set in the Amazon VPC console (https://console.aws.amazon.com/vpc/). For more information, see Setting Up DNS in Your VPC.

    • For this tutorial, ensure that the Amazon VPC you select has an Amazon S3 VPC endpoint. For information about how to create an Amazon S3 VPC endpoint, see Amazon VPC Endpoints for Amazon S3.

    • Choose a public subnet for your development endpoint. You can make a subnet a public subnet by adding a route to an internet gateway. For IPv4 traffic, create a route with Destination 0.0.0.0/0 and Target the internet gateway ID. Your subnet’s route table should be associated with an internet gateway, not a NAT gateway. This information can be set in the Amazon VPC console (https://console.aws.amazon.com/vpc/). For example:

      An example of a route table with an internet gateway.

      For more information, see Route tables for Internet Gateways. For information about how to create an internet gateway, see Internet Gateways.

    • Ensure that you choose a security group that has an inbound self-reference rule. This information can be set in the Amazon VPC console (https://console.aws.amazon.com/vpc/). For example:

      An example of a self-referencing inbound rule.

      For more information about how to set up your subnet, see Setting Up Your Environment for Development Endpoints.

    Choose Next.

  5. In SSH Public Key, enter a public key generated by an SSH key generator program (do not use an Amazon EC2 key pair). Save the corresponding private key to later connect to the development endpoint using SSH. Choose Next.

    Note

    When generating the key on Microsoft Windows, use a current version of PuTTYgen and paste the public key into the AWS Glue console from the PuTTYgen window. Generate an RSA key. Do not upload a file with the public key, instead use the key generated in the field Public key for pasting into OpenSSH authorized_keys file. The corresponding private key (.ppk) can be used in PuTTY to connect to the development endpoint. To connect to the development endpoint with SSH on Windows, convert the private key from .ppk format to OpenSSH .pem format using the PuTTYgen Conversion menu. For more information, see Connecting to Your Linux Instance from Windows Using PuTTY.

  6. In Review, choose Finish. After the development endpoint is created, wait for its provisioning status to move to READY.

You are now ready to try out the tutorials in this section: