Tutorial setup: prerequisites for the development endpoint tutorials
Development endpoints create an environment where you can interactively test and debug ETL scripts in various ways before you run them as AWS Glue jobs. The tutorials in this section show you how to do this using different IDEs. All of them assume that you have set up a development endpoint and crawled sample data to create tables in your AWS Glue Data Catalog using the steps in the following sections.
Because you're using only Amazon Simple Storage Service (Amazon S3) data in some cases, and a mix of JDBC and Amazon S3 data in others, you will set up one development endpoint that is not in a virtual private cloud (VPC) and one that is.
Crawling the sample data used in the tutorials
The first step is to create a crawler that can crawl some sample data and record metadata
about it in tables in your Data Catalog. The sample data that is used is drawn from http://everypolitician.org/
Sign in to the AWS Management Console and open the AWS Glue console at https://console.aws.amazon.com/glue/
. In the AWS Glue console, choose Databases in the navigation pane, and then choose Add database. Name the database
legislators
.-
Choose Crawlers, and then choose Add crawler. Name the crawler
legislator_crawler
, and then choose Next. -
Accept the default crawler source type (Data stores) and click Next.
-
Leave S3 as the data store. Under Crawl data in, choose Specified path in another account. Then in the Include path box, enter
s3://awsglue-datasets/examples/us-legislators/all
. Choose Next, and then choose Next again to confirm that you don't want to add another data store. -
Provide an IAM role for the crawler to assume when it runs.
Provide a role that can access
s3://awsglue-datasets/examples/us-legislators/all
, or choose Create an IAM role and enter a name to create a role that has access to that location. -
Choose Next, and then choose Next again to confirm that this crawler will be run on demand.
-
For Database, choose the
legislators
database. Choose Next, and then choose Finish to complete the creation of the new crawler. -
Choose Crawlers in the navigation pane again. Select the check box next to the new
legislator_crawler
crawler, and choose Run crawler. -
Choose Databases in the navigation pane. Choose the
legislators
database, and then choose Tables in legislators. You should see six tables created by the crawler in your Data Catalog, containing metadata that the crawler retrieved.
Creating a development endpoint for Amazon S3 data
The next thing to do is to create a development endpoint for Amazon S3 data. When you use a JDBC data source or target, the development endpoint must be created with a VPC. However, this isn't necessary in this tutorial if you are only accessing Amazon S3.
-
In the AWS Glue console, choose Dev endpoints. Choose Add endpoint.
-
Specify an endpoint name, such as
demo-endpoint
. -
Choose an IAM role with permissions similar to the IAM role that you use to run AWS Glue ETL jobs. For more information, see Step 2: Create an IAM role for AWS Glue. Choose Next.
-
In Networking, leave Skip networking information selected, and choose Next.
-
In SSH Public Key, enter a public key generated by an SSH key generator program, such as ssh-keygen (do not use an Amazon EC2 key pair). The generated public key will be imported into your development endpoint. Save the corresponding private key to later connect to the development endpoint using SSH. Choose Next. For more information, see ssh-keygen
in Wikipedia. Note When generating the key on Microsoft Windows, use a current version of PuTTYgen and paste the public key into the AWS Glue console from the PuTTYgen window. Generate an RSA key. Do not upload a file with the public key, instead use the key generated in the field Public key for pasting into OpenSSH authorized_keys file. The corresponding private key (.ppk) can be used in PuTTY to connect to the development endpoint. To connect to the development endpoint with SSH on Windows, convert the private key from
.ppk
format to OpenSSH.pem
format using the PuTTYgen Conversion menu. For more information, see Connecting to Your Linux Instance from Windows Using PuTTY. -
In Review, choose Finish. After the development endpoint is created, wait for its provisioning status to move to READY.
Creating an Amazon S3 location to use for output
If you don't already have a bucket, follow the instructions in Create a Bucket to set one up in Amazon S3 where you can save output from sample ETL scripts.
Creating a development endpoint with a VPC
Although not required for this tutorial, a VPC development endpoint is needed if both Amazon S3 and JDBC data stores are accessed by your ETL statements. In this case, when you create a development endpoint you specify network properties of the virtual private cloud (Amazon VPC) that contains your JDBC data stores. Before you start, set up your environment as explained in Setting up networking for development for AWS Glue.
-
In the AWS Glue console, choose Dev endpoints in the navigation pane. Then choose Add endpoint.
-
Specify an endpoint name, such as
vpc-demo-endpoint
. -
Choose an IAM role with permissions similar to the IAM role that you use to run AWS Glue ETL jobs. For more information, see Step 2: Create an IAM role for AWS Glue. Choose Next.
-
In Networking, specify an Amazon VPC, a subnet, and security groups. This information is used to create a development endpoint that can connect to your data resources securely. Consider the following suggestions when filling in the properties of your endpoint:
If you already set up a connection to your data stores, you can retrieve the connection details from the existing connection to use in configuring the Amazon VPC, Subnet, and Security groups parameters for your endpoint. Otherwise, specify these parameters individually.
Ensure that your Amazon VPC has Edit DNS hostnames set to yes. This parameter can be set in the Amazon VPC console (https://console.aws.amazon.com/vpc/
). For more information, see Setting up DNS in your VPC. For this tutorial, ensure that the Amazon VPC you select has an Amazon S3 VPC endpoint. For information about how to create an Amazon S3 VPC endpoint, see Amazon VPC endpoints for Amazon S3.
-
Choose a public subnet for your development endpoint. You can make a subnet a public subnet by adding a route to an internet gateway. For IPv4 traffic, create a route with Destination
0.0.0.0/0
and Target the internet gateway ID. Your subnet’s route table should be associated with an internet gateway, not a NAT gateway. This information can be set in the Amazon VPC console (https://console.aws.amazon.com/vpc/). For example: For more information, see Route tables for Internet Gateways. For information about how to create an internet gateway, see Internet Gateways.
-
Ensure that you choose a security group that has an inbound self-reference rule. This information can be set in the Amazon VPC console (https://console.aws.amazon.com/vpc/
). For example: For more information about how to set up your subnet, see Setting up networking for development for AWS Glue.
Choose Next.
-
In SSH Public Key, enter a public key generated by an SSH key generator program (do not use an Amazon EC2 key pair). Save the corresponding private key to later connect to the development endpoint using SSH. Choose Next.
Note When generating the key on Microsoft Windows, use a current version of PuTTYgen and paste the public key into the AWS Glue console from the PuTTYgen window. Generate an RSA key. Do not upload a file with the public key, instead use the key generated in the field Public key for pasting into OpenSSH authorized_keys file. The corresponding private key (.ppk) can be used in PuTTY to connect to the development endpoint. To connect to the development endpoint with SSH on Windows, convert the private key from
.ppk
format to OpenSSH.pem
format using the PuTTYgen Conversion menu. For more information, see Connecting to Your Linux Instance from Windows Using PuTTY. -
In Review, choose Finish. After the development endpoint is created, wait for its provisioning status to move to READY.
You are now ready to try out the tutorials in this section: