Menu
AWS Glue
Developer Guide

Adding a Connection to Your Data Store

Connections are used by crawlers and jobs in AWS Glue to access certain types of data stores. For information about adding a connection using the AWS Glue console, see Working with Connections on the AWS Glue Console.

When Is a Connection Used?

If your data store requires one, the connection is used when you crawl a data store to catalog its metadata in the AWS Glue Data Catalog. The connection is also used by any job that uses the data store as a source or target.

Defining a Connection in the AWS Glue Data Catalog

Some types of data stores require additional connection information to access your data. This information might include an additional user name and password (different from your AWS credentials), or other information that is required to connect to the data store.

AWS Glue can connect to the following data stores by using the JDBC protocol:

  • Amazon Redshift

  • Amazon Relational Database Service

    • Amazon Aurora

    • MariaDB

    • MySQL

    • PostgreSQL

  • Publicly accessible databases

    • Amazon Aurora

    • MariaDB

    • MySQL

    • PostgreSQL

A connection is not typically required for Amazon S3. However, to access Amazon S3 from within your virtual private cloud (VPC), an Amazon S3 VPC endpoint is required. For more information, see Amazon VPC Endpoints for Amazon S3.

In your connection information, you also must consider whether data is accessed through a VPC and then set up network parameters accordingly.

Connecting to a JDBC Data Store in a VPC

Typically, you create resources inside Amazon Virtual Private Cloud (Amazon VPC) so that they cannot be accessed over the public internet. By default, resources in a VPC can't be accessed from AWS Glue. To enable AWS Glue to access resources inside your VPC, you must provide additional VPC-specific configuration information that includes VPC subnet IDs and security group IDs. AWS Glue uses this information to set up elastic network interfaces that enable your function to connect securely to other resources in your private VPC.

Accessing VPC Data Using Elastic Network Interfaces

When AWS Glue connects to a JDBC data store in a VPC, AWS Glue creates an elastic network interface (with the prefix Glue_) in your account to access your VPC data. You can't delete this network interface as long as it's attached to AWS Glue. As part of creating the elastic network interface, AWS Glue associates one or more security groups to it. To enable AWS Glue to create the network interface, security groups that are associated with the resource must allow inbound access with a source rule. This rule contains a security group that is associated with the resource. This gives the elastic network interface access to your data store with the same security group.

To allow AWS Glue to communicate with its components, specify a security group with a self-referencing inbound rule for all TCP ports. By creating a self-referencing rule, you can restrict the source to the same security group in the VPC and not open it to all networks. The default security group for your VPC might already have a self-referencing inbound rule for ALL Traffic.

You can create rules in the Amazon VPC console. To update rule settings via the AWS Management Console, navigate to the VPC console (https://console.aws.amazon.com/vpc/), and select the appropriate security group. Specify the inbound rule for ALL TCP to have as its source the same security group name. For more information about security group rules, see Security Groups for Your VPC.

Each elastic network interface is assigned a private IP address from the IP address range in the subnets that you specify. The network interface is not assigned any public IP addresses. AWS Glue requires internet access (for example, to access AWS services that don't have VPC endpoints). You can configure a network address translation (NAT) instance inside your VPC, or you can use the Amazon VPC NAT gateway. For more information, see NAT Gateways in the Amazon VPC User Guide. You can't directly use an internet gateway attached to your VPC as a route in your subnet route table because that requires the network interface to have public IP addresses.

The VPC network attributes enableDnsHostnames and enableDnsSupport must be set to true. For more information, see Using DNS with your VPC.

Important

  • Don't put your data store in a public subnet or in a private subnet that doesn't have internet access. Instead, attach it only to private subnets that have internet access through a NAT instance or an Amazon VPC NAT gateway.

  • Currently, an extract, transform, and load (ETL) job can access resources in a single subnet only. If you have multiple data stores in a job, they must be on the same subnet.

Elastic Network Interface Properties

To create the elastic network interface, you must supply the following properties:

VPC

The name of the VPC that contains your data store.

Subnet

The subnet in the VPC that contains your data store.

Security groups

The security groups associated with your data store. AWS Glue associates these security groups with the elastic network interface that is attached to your VPC subnet.

Availability Zone

If you're connecting using the API or the AWS Command Line Interface, the Availability Zone of the subnet.

For information about managing a VPC with Amazon Redshift, see Managing Clusters in an Amazon Virtual Private Cloud (VPC).

For information about managing a VPC with Amazon RDS, see Working with an Amazon RDS DB Instance in a VPC.