Using connectors and connections with AWS Glue Studio - AWS Glue Studio

Using connectors and connections with AWS Glue Studio

AWS Glue provides built-in support for the most commonly used data stores (such as Amazon Redshift, Amazon Aurora, Microsoft SQL Server, MySQL, MongoDB, and PostgreSQL) using JDBC connections. AWS Glue also allows you to use custom JDBC drivers in your extract, transform, and load (ETL) jobs. For data stores that are not natively supported, such as SaaS applications, you can use connectors.

A connector is an optional code package that assists with accessing data stores in AWS Glue Studio. You can subscribe to several connectors offered in AWS Marketplace.

When creating ETL jobs, you can use a natively supported data store, a connector from AWS Marketplace, or your own custom connectors. If you use a connector, you must first create a connection for the connector. A connection contains the properties that are required to connect to a particular data store. You use the connection with your data sources and data targets in the ETL job. Connectors and connections work together to facilitate access to the data stores.

Overview of using connectors and connections

A connection contains the properties that are required to connect to a particular data store. When you create a connection, it is stored in the AWS Glue Data Catalog. You choose a connector, and then create a connection based on that connector.

You can subscribe to connectors for non-natively supported data stores in AWS Marketplace, and then use those connectors when you're creating connections. Developers can also create their own connectors, and you can use them when creating connections.

Note

Connections created using custom or AWS Marketplace connectors in AWS Glue Studio appear in the AWS Glue console with type set to UNKNOWN.

The following steps describe the overall process of using connectors in AWS Glue Studio:

  1. Subscribe to a connector in AWS Marketplace, or develop your own connector and upload it to AWS Glue Studio. For more information, see Adding connectors to AWS Glue Studio.

  2. Review the connector usage information. You can find this information on the Usage tab on the connector product page. For example, if you click the Usage tab on this product page, AWS Glue Connector for Google BigQuery, you can see in the Additional Resources section a link to a blog about using this connector. Other connectors might contain links to the instructions in the Overview section, as shown on the connector product page for Cloudwatch Logs connector for AWS Glue.

  3. Create a connection. You choose which connector to use and provide additional information for the connection, such as login credentials, URI strings, and virtual private cloud (VPC) information. For more information, see Creating connections for connectors.

  4. Create an IAM role for your job. The job assumes the permissions of the IAM role that you specify when you create it. This IAM role must have the necessary permissions to authenticate with, extract data from, and write data to your data stores. For more information, see Review IAM permissions needed for ETL jobs and Permissions required for using connectors.

  5. Create an ETL job and configure the data source properties for your ETL job. Provide the connection options and authentication information as instructed by the custom connector provider. For more information, see Authoring jobs with custom connectors.

  6. Customize your ETL job by adding transforms or additional data stores, as described in Editing ETL jobs in AWS Glue Studio.

  7. If using a connector for the data target, configure the data target properties for your ETL job. Provide the connection options and authentication information as instructed by the custom connector provider. For more information, see Authoring jobs with custom connectors.

  8. Customize the job run environment by configuring job properties, as described in Modify the job properties.

  9. Run the job.

Adding connectors to AWS Glue Studio

A connector is a piece of code that facilitates communication between your data store and AWS Glue. You can either subscribe to a connector offered in AWS Marketplace, or you can create your own custom connector.

Subscribing to AWS Marketplace connectors

AWS Glue Studio makes it easy to add connectors from AWS Marketplace.

To add a connector from AWS Marketplace to AWS Glue Studio
  1. In the AWS Glue Studio console, choose Connectors in the console navigation pane.

  2. On the Connectors page, choose Go to AWS Marketplace.

  3. In AWS Marketplace, in Featured products, choose the connector you want to use. You can choose one of the featured connectors, or use search. You can search on the name or type of connector, and you can use options to refine the search results.

    If you want to use one of the featured connectors, choose View product. If you used search to locate a connector, then choose the name of the connector.

  4. On the product page for the connector, use the tabs to view information about the connector. If you decide to purchase this connector, choose Continue to Subscribe.

  5. Provide the payment information, and then choose Continue to Configure.

  6. On the Configure this software page, choose the method of deployment and the version of the connector to use. Then choose Continue to Launch.

  7. On the Launch this software page, you can review the Usage Instructions provided by the connector provider. When you're ready to continue, choose Activate connection in AWS Glue Studio.

    After a small amount of time, the console displays the Create marketplace connection page in AWS Glue Studio.

  8. Create a connection that uses this connector, as described in Creating connections for connectors.

    Alternatively, you can choose Activate connector only to skip creating a connection at this time. You must create a connection at a later date before you can use the connector.

Creating custom connectors

You can also build your own connector and then upload the connector code to AWS Glue Studio.

Custom connectors are integrated into AWS Glue Studio through the AWS Glue Spark runtime API. The AWS Glue Spark runtime allows you to plug in any connector that is compliant with the Spark, Athena, or JDBC interface. It allows you to pass in any connection option that is available with the custom connector.

You can encapsulate all your connection properties with AWS Glue Connections and supply the connection name to your ETL job. Integration with Data Catalog connections allows you to use the same connection properties across multiple calls in a single Spark application or across different applications.

You can specify additional options for the connection. The job script that AWS Glue Studio generates contains a Datasource entry that uses the connection to plug in your connector with the specified connection options. For example:

Datasource = glueContext.create_dynamic_frame.from_options(connection_type = "custom.jdbc", connection_options = {"dbTable":"Account","connectionName":"my-custom-jdbc- connection"}, transformation_ctx = "DataSource0")
To add a custom connector to AWS Glue Studio
  1. Create the code for your custom connector. For more information, see Developing custom connectors.

  2. Add support for AWS Glue features to your connector. Here are some examples of these features and how they are used within the job script generated by AWS Glue Studio:

    • Data type mapping – Your connector can typecast the columns while reading them from the underlying data store. For example, a dataTypeMapping of {"INTEGER":"STRING"} converts all columns of type Integer to columns of type String when parsing the records and constructing the DynamicFrame. This helps users to cast columns to types of their choice.

      DataSource0 = glueContext.create_dynamic_frame.from_options(connection_type = "custom.jdbc", connection_options = {"dataTypeMapping":{"INTEGER":"STRING"}", connectionName":"test-connection-jdbc"}, transformation_ctx = "DataSource0")
    • Partitioning for parallel reads – AWS Glue allows parallel data reads from the data store by partitioning the data on a column. You must specify the partition column, the lower partition bound, the upper partition bound, and the number of partitions. This feature enables you to make use of data parallelism and multiple Spark executors allocated for the Spark application.

      DataSource0 = glueContext.create_dynamic_frame.from_options(connection_type = "custom.jdbc", connection_options = {"upperBound":"200","numPartitions":"4", "partitionColumn":"id","lowerBound":"0","connectionName":"test-connection-jdbc"}, transformation_ctx = "DataSource0")
    • Use AWS Secrets Manager for storing credentials –The Data Catalog connection can also contain a secretId for a secret stored in AWS Secrets Manager. The AWS secret can securely store authentication and credentials information and provide it to AWS Glue at runtime. Alternatively, you can specify the secretId from the Spark script as follows:

      DataSource = glueContext.create_dynamic_frame.from_options(connection_type = "custom.jdbc", connection_options = {"connectionName":"test-connection-jdbc", "secretId"-> "my-secret-id"}, transformation_ctx = "DataSource0")
    • Filtering the source data with row predicates and column projections – The AWS Glue Spark runtime also allows users to push down SQL queries to filter data at the source with row predicates and column projections. This allows your ETL job to load filtered data faster from data stores that support push-downs. An example SQL query pushed down to a JDBC data source is: SELECT id, name, department FROM department WHERE id < 200.

      DataSource = glueContext.create_dynamic_frame.from_options(connection_type = "custom.jdbc", connection_options = {"query":"SELECT id, name, department FROM department WHERE id < 200","connectionName":"test-connection-jdbc"}, transformation_ctx = "DataSource0")
    • Job bookmarks – AWS Glue supports incremental loading of data from JDBC sources. AWS Glue keeps track of the last processed record from the data store, and processes new data records in the subsequent ETL job runs. Job bookmarks use the primary key as the default column for the bookmark key, provided that this column increases or decreases sequentially. For more information about job bookmarks, see Job Bookmarks in the AWS Glue Developer Guide.

      DataSource0 = glueContext.create_dynamic_frame.from_options(connection_type = "custom.jdbc", connection_options = {"jobBookmarkKeys":["empno"], "jobBookmarkKeysSortOrder" :"asc", "connectionName":"test-connection-jdbc"}, transformation_ctx = "DataSource0")
  3. Package the custom connector as a JAR file and upload the file to Amazon S3.

  4. Test your custom connector. For more information, see the instructions on GitHub at Glue Custom Connectors: Local Validation Tests Guide.

  5. In the AWS Glue Studio console, choose Connectors in the console navigation pane.

  6. On the Connectors page, choose Create custom connector.

  7. On the Create custom connector page, enter the following information:

    • The path to the location of the custom code JAR file in Amazon S3.

    • A name for the connector that will be used by AWS Glue Studio.

    • Your connector type, which can be one of JDBC, Spark, or Athena.

    • The name of the entry point within your custom code that AWS Glue Studio calls to use the connector.

      • For JDBC connectors, this field should be the class name of your JDBC driver.

      • For Spark connectors, this field should be the fully qualified data source class name, or its alias, that you use when loading the Spark data source with the format operator.

    • (JDBC only) The base URL used by the JDBC connection for the data store.

    • (Optional) A description of the custom connector.

  8. Choose Create connector.

  9. From the Connectors page, create a connection that uses this connector, as described in Creating connections for connectors.

Creating connections for connectors

An AWS Glue connection is a Data Catalog object that stores connection information for a particular data store. Connections store login credentials, URI strings, virtual private cloud (VPC) information, and more. Creating connections in the Data Catalog saves the effort of having to specify all connection details every time you create a job.

Note

Connections created using the AWS Glue console do not appear in AWS Glue Studio.

To create a connection for a connector
  1. In the AWS Glue Studio console, choose Connectors in the console navigation pane.

  2. Choose the connector you want to create a connection for, and then choose Create connection.

  3. On the Create connection page, enter a name for your connection, and optionally a description.

  4. Enter the connection details. Depending on the type of connector you selected, you're prompted to enter additional information:

    • Enter the requested authentication information, such as a user name and password, or choose an AWS secret.

    • For connectors that use JDBC, enter the information required to create the JDBC URL for the data store.

    • If you use a virtual private cloud (VPC), then enter the network information for your VPC.

  5. Choose Create connection.

    You are returned to the Connectors page, and the informational banner indicates the connection that was created. You can now use the connection in your AWS Glue Studio jobs, as described in Create jobs that use a connector.

Creating a Kafka connection

When creating a Kafka connection, selecting Kafka from the drop-down menu will display additional settings to configure:

  • Kafka cluster details

  • Authentication

  • Encryption

  • Network options

Configure Kafka cluster details

  1. Choose the cluster location. You can choose from an Amazon managed streaming for Apache Kafka (MSK) cluster or a Customer managed Apache Kafka cluster. For more information on Amazon Managed streaming for Apache Kafka, see Amazon managed streaming for Apache Kafka (MSK).

    Note

    Amazon Managed Streaming for Apache Kafka only supports TLS and SASL/SCRAM-SHA-512 authentication methods.

    
              The screenshot shows the Kafka cluster details section with options to select a 
              Cluster location and to enter Kafka boostrap server URLs.
  2. Enter the URLs for your Kafka bootstrap servers. You may enter more than one by separating each server by a comma. Include the port number at the end of the URL by appending :<port number>.

    For example: b-1.vpc-test-2.034a88o.kafka-us-east-1.amazonaws.com:9094

Select authentication method


          The screenshot shows the drop-down menu for selecting a Kafka authentication method.

AWS Glue supports the Simple Authentication and Security Layer (SASL) framework for authentication. The SASL framework supports various mechanisms of authentication, and AWS Glue offers both the SCRAM protocol (username and password) and GSSAPI (Kerberos protocol).

When choosing an authentication method from the drop-down menu, the following client authentication methods can be selected:

  • None - No authentication. This is useful if you create a connection for testing purposes.

  • SASL/SCRAM-SHA-512 - Choose this authentication method to specify authentication credentials. There are two options available:

    • Use AWS Secrets Manager (recommended) - if you select this option, you can store your credentials in AWS Secrets Manager and let AWS Glue access the information when needed. Specify the secret that stores the SSL or SASL authentication credentials.

      
                  The screenshot shows the options for authentication credentials if the authentication method
                    is SASL/SCRAM-SHA-512.
    • Provider username and password directly.

  • SASL/GSSAPI (Kerberos) - if you select this option, you can select the location of the keytab file, krb5.conf file and enter the Kerberos principal name and Kerberos service name. The locations for the keytab file and krb5.conf file must be in an Amazon S3 location. Since MSK does not yet support SASL/GSSAPI, this option is only available for customer managed Apache Kafka clusters. For more information, see MIT Kerberos Documentation: Keytab .

  • SSL Client Authentication - if you select this option, you can you can select the location of the Kafka client keystore by browsing Amazon S3. Optionally, you can enter the Kafka client keystore password and Kafka client key password.


          The screenshot shows the encryption option if SSL is the authentication method.

Configure encryption settings

  1. If the Kafka connection requires SSL connection, select the checkbox for Require SSL connection. Note that the connection will fail if it's unable to connect over SSL. SSL for encyption can be used with any of the authentication methods (SASL/SCRAM-SHA-512, SASL/GSSAPI, SSL Client Authentication) and is optional.

    If the authentication method is set to SSL client authentication, this option will be selected automatically and will be disabled to prevent any changes.

  2. (Optional). Choose the location of private certificate from certificate authority (CA). Note that the location of the certification must be in an S3 location. Choose Browse to choose the file from a connected S3 bucket. The path must be in the form s3://bucket/prefix/filename.pem. It must end with the file name and .pem extension.

  3. You can choose to skip validation of certificate from a certificate authority (CA). Choose the checkbox Skip validation of certificate from certificate authority (CA). If this box is not checked, AWS Glue validates certificates for three algorithms:

    • SHA256withRSA

    • SHA384withRSA

    • SHA512withRSA


          The screenshot shows the options for configuring encryption, including
            whether or not to require SSL connection, the option to select the location of the private
            certificate from certificate authority (CA), and the option to skip validation of certificate from
            certificate authority (CA).

(Optional) Network options

The following are optional steps to configure VPC, Subnet and Security groups. If your AWS Glue job needs to run on Amazon EC2 instances in a virtual private cloud (VPC) subnet, you must provide additional VPC-specific configuration information.

  1. Choose the VPC (virtual private cloud) that contains your data source.

  2. Choose the subnet with your VPC.

  3. Choose one or more security groups to allow access to the data store in your VPC subnet. Security groups are associated to the ENI attached to your subnet. You must choose at least one security group with a self-referencing inbound rule for all TCP ports.


          The screenshot shows the optional network options for VPC, Subnet and Security groups.

Authoring jobs with custom connectors

You can use connectors and connections for both data source nodes and data target nodes in AWS Glue Studio.

Create jobs that use a connector for the data source

When you create a new job, you can choose a connector for the data source and data targets.

To create a job that uses connectors for the data source or data target
  1. Sign in to the AWS Management Console and open the AWS Glue Studio console at https://console.aws.amazon.com/gluestudio/.

  2. On the Connectors page, in the Your connections resource list, choose the connection you want to use in your job, and then choose Create job.

    Alternatively, on the AWS Glue Studio Jobs page, under Create job, choose Source and target added to the graph. In the Source drop-down list, choose the custom connector that you want to use in your job. You can also choose a connector for Target.

    
              The image is a screenshot of the Jobs page, with the Source drop-down list
                selected, showing the various data sources that can be selected for the job, including connectors.
  3. Choose Create to open the visual job editor.

  4. Configure the data source node, as described in Configure source properties for nodes that use connectors.

  5. Continue creating your ETL job by adding transforms, additional data stores, and data targets, as described in Editing ETL jobs in AWS Glue Studio.

  6. Customize the job run environment by configuring job properties as described in Modify the job properties.

  7. Save and run the job.

Configure source properties for nodes that use connectors

After you create a job that uses a connector for the data source, the visual job editor displays a job graph with a data source node configured for the connector. You must configure the data source properties for that node.

To configure the properties for a data source node that uses a connector
  1. Choose the connector data source node in the job graph or add a new node and choose the connector for the Node type. Then, on the right-side, in the node details panel, choose the Data source properties tab, if it's not already selected.

    
              The image is a screenshot of the AWS Glue Studio visual job editor page, with a data
                source node selected in the graph. The Data source properties tab on the right is
                selected. The fields displayed for the data source properties are Connection
                (a drop-down list of available connections, followed by a Refresh button)
                and an Add schema button. An additional Connection options section is shown
                in its collapsed state.
  2. In the Data source properties tab, choose the connection that you want to use for this job.

    Enter the additional information required for each connection type:

    JDBC
    • Data source input type: Choose to provide either a table name or a SQL query as the data source. Depending on your choice, you then need to provide the following additional information:

      • Table name: The name of the table in the data source. If the data source does not use the term table, then supply the name of an appropriate data structure, as indicated by the custom connector usage information (which is available in AWS Marketplace).

      • Filter predicate: A condition clause to use when reading the data source, similar to a WHERE clause, which is used to retrieve a subset of the data.

      • Query code: Enter a SQL query to use to retrieve a specific dataset from the data source. An example of a basic SQL query is:

        SELECT column_list FROM table_name WHERE where_clause
    • Schema: Because AWS Glue Studio is using information stored in the connection to access the data source instead of retrieving metadata information from a Data Catalog table, you must provide the schema metadata for the data source. Choose Add schema to open the schema editor.

      For instructions on how to use the schema editor, see Editing the schema in a custom transform node.

    • Partition column: (Optional) You can choose to partition the data reads by providing values for Partition column, Lower bound, Upper bound, and Number of partitions.

      The lowerBound and upperBound values are used to decide the partition stride, not for filtering the rows in table. All rows in the table are partitioned and returned.

      Note

      Column partitioning adds an extra partitioning condition to the query used to read the data. When using a query instead of a table name, you should validate that the query works with the specified partitioning condition. For example:

      • If your query format is "SELECT col1 FROM table1", then test the query by appending a WHERE clause at the end of the query that uses the partition column.

      • If your query format is "SELECT col1 FROM table1 WHERE col2=val", then test the query by extending the WHERE clause with AND and an expression that uses the partition column.

    • Data type casting: If the data source uses data types that are not available in JDBC, use this section to specify how a data type from the data source should be converted into JDBC data types. You can specify up to 50 different data type conversions. All columns in the data source that use the same data type are converted in the same way.

      For example, if you have three columns in the data source that use the Float data type, and you indicate that the Float data type should be converted to the JDBC String data type, then all three columns that use the Float data type are converted to String data types.

    • Job bookmark keys: Job bookmarks help AWS Glue maintain state information and prevent the reprocessing of old data. Specify one more one or more columns as bookmark keys. AWS Glue Studio uses bookmark keys to track data that has already been processed during a previous run of the ETL job. Any columns you use for custom bookmark keys must be strictly monotonically increasing or decreasing, but gaps are permitted.

      If you enter multiple bookmark keys, they're combined to form a single compound key. A compound job bookmark key should not contain duplicate columns. If you don't specify bookmark keys, AWS Glue Studio by default uses the primary key as the bookmark key, provided that the primary key is sequentially increasing or decreasing (with no gaps). If the table doesn't have a primary key, but the job bookmark property is enabled, you must provide custom job bookmark keys. Otherwise, the search for primary keys to use as the default will fail and the job run will fail.

    • Job bookmark keys sorting order: Choose whether the key values are sequentially increasing or decreasing.

    Spark
    • Schema: Because AWS Glue Studio is using information stored in the connection to access the data source instead of retrieving metadata information from a Data Catalog table, you must provide the schema metadata for the data source. Choose Add schema to open the schema editor.

      For instructions on how to use the schema editor, see Editing the schema in a custom transform node.

    • Connection options: Enter additional key-value pairs as needed to provide additional connection information or options. For example, you might enter a database name, table name, a user name, and password.

      For example, for OpenSearch, you enter the following key-value pairs, as described in Tutorial: Using the AWS Glue Connector for Elasticsearch :

      • es.net.http.auth.user : username

      • es.net.http.auth.pass : password

      • es.nodes : https://<Elasticsearch endpoint>

      • es.port : 443

      • path: <Elasticsearch resource>

      • es.nodes.wan.only : true

    For an example of the minimum connection options to use, see the sample test script MinimalSparkConnectorTest.scala on GitHub, which shows the connection options you would normally provide in a connection.

    Athena
    • Table name: The name of the table in the data source. If you're using a connector for reading from Athena-CloudWatch logs, you would enter the table name all_log_streams.

    • Athena schema name: Choose the schema in your Athena data source that corresponds to the database that contains the table. If you're using a connector for reading from Athena-CloudWatch logs, you would enter a schema name similar to /aws/glue/name.

    • Schema: Because AWS Glue Studio is using information stored in the connection to access the data source instead of retrieving metadata information from a Data Catalog table, you must provide the schema metadata for the data source. Choose Add schema to open the schema editor.

      For instructions on how to use the schema editor, see Editing the schema in a custom transform node.

    • Additional connection options: Enter additional key-value pairs as needed to provide additional connection information or options.

    For an example, see the README.md file at https://github.com/aws-samples/aws-glue-samples/tree/master/GlueCustomConnectors/development/Athena. In the steps in this document, the sample code shows the minimal required connection options, which are tableName, schemaName, and className. The code example specifies these options as part of the optionsMap variable, but you can specify them for your connection and then use the connection.

  3. (Optional) After providing the required information, you can view the resulting data schema for your data source by choosing the Output schema tab in the node details panel. The schema displayed on this tab is used by any child nodes that you add to the job graph.

  4. (Optional) After configuring the node properties and data source properties, you can preview the dataset from your data source by choosing the Data preview tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. There is a cost associated with using this feature, and billing starts as soon as you provide an IAM role.

Configure target properties for nodes that use connectors

If you use a connector for the data target type, you must configure the properties of the data target node.

To configure the properties for a data target node that uses a connector
  1. Choose the connector data target node in the job graph. Then, on the right-side, in the node details panel, choose the Data target properties tab, if it's not already selected.

  2. In the Data target properties tab, choose the connection to use for writing to the target.

    Enter the additional information required for each connection type:

    JDBC
    • Connection: Choose the connection to use with your connector. For information about how to create a connection, see Creating connections for connectors.

    • Table name: The name of the table in the data target. If the data target does not use the term table, then supply the name of an appropriate data structure, as indicated by the custom connector usage information (which is available in AWS Marketplace).

    • Batch size (Optional): Enter the number of rows or records to insert in the target table in a single operation. The default value is 1000 rows.

    Spark
    • Connection: Choose the connection to use with your connector. If you did not create a connection previously, choose Create connection to create one. For information about how to create a connection, see Creating connections for connectors.

    • Connection options: Enter additional key-value pairs as needed to provide additional connection information or options. You might enter a database name, table name, a user name, and password.

      For example, for OpenSearch, you enter the following key-value pairs, as described in Tutorial: Using the AWS Glue Connector for Elasticsearch :

      • es.net.http.auth.user : username

      • es.net.http.auth.pass : password

      • es.nodes : https://<Elasticsearch endpoint>

      • es.port : 443

      • path: <Elasticsearch resource>

      • es.nodes.wan.only : true

    For an example of the minimum connection options to use, see the sample test script MinimalSparkConnectorTest.scala on GitHub, which shows the connection options you would normally provide in a connection.

  3. After providing the required information, you can view the resulting data schema for your data source by choosing the Output schema tab in the node details panel.

Managing connectors and connections

You use the Connectors page in AWS Glue Studio to manage your connectors and connections.

Viewing connector and connection details

You can view summary information about your connectors and connections in the Your connectors and Your connections resource tables on the Connectors page. To view detailed information, perform the following steps.

To view connector or connection details
  1. In the AWS Glue Studio console, choose Connectors in the console navigation pane.

  2. Choose the connector or connection that you want to view detailed information for.

  3. Choose Actions, and then choose View details to open the detail page for that connector or connection.

  4. On the detail page, you can choose to Edit or Delete the connector or connection.

    • For connectors, you can choose Create connection to create a new connection that uses the connector.

    • For connections, you can choose Create job to create a job that uses the connection.

Testing connections

You can test an existing connection in the Connections panel.

To test a connection:
  1. In the AWS Glue Studio console, choose Connections in the console navigation pane.

  2. Choose the connection that you want to test.

  3. Choose Actions and then choose Test connection.

  4. In the Test connection dialog box, choose a role or choose Create IAM role to create a new role through the IAM console. The role must have permissions on the data store.

  5. Choose Test connection to start the test.

Editing connectors and connections

You use the Connectors page to change the information stored in your connectors and connections.

To modify a connector or connection
  1. In the AWS Glue Studio console, choose Connectors in the console navigation pane.

  2. Choose the connector or connection that you want to change.

  3. Choose Actions, and then choose Edit.

    You can also choose View details and on the connector or connection detail page, you can choose Edit.

  4. On the Edit connector or Edit connection page, update the information, and then choose Save.

Deleting connectors and connections

You use the Connectors page to delete connectors and connections. If you delete a connector, then any connections that were created for that connector should also be deleted.

To remove connectors from AWS Glue Studio
  1. In the AWS Glue Studio console, choose Connectors in the console navigation pane.

  2. Choose the connector or connection you want to delete.

  3. Choose Actions, and then choose Delete.

    You can also choose View details, and on the connector or connection detail page, you can choose Delete.

  4. Verify that you want to remove the connector or connection by entering Delete, and then choose Delete.

    When deleting a connector, any connections that were created for that connector are also deleted.

Any jobs that use a deleted connection will no longer work. You can either edit the jobs to use a different data store, or remove the jobs. For information about how to delete a job, see Delete jobs.

If you delete a connector, this doesn't cancel the subscription for the connector in AWS Marketplace. To remove a subscription for a deleted connector, follow the instructions in Cancel a subscription for a connector .

Cancel a subscription for a connector

After you delete the connections and connector from AWS Glue Studio, you can cancel your subscription in AWS Marketplace if you no longer need the connector.

Note

If you cancel your subscription to a connector, this does not remove the connector or connection from your account. Any jobs that use the connector and related connections will no longer be able to use the connector and will fail.

Before you unsubscribe or re-subscribe to a connector from AWS Marketplace, you should delete existing connections and connectors associated with that AWS Marketplace product.

To unsubscribe from a connector in AWS Marketplace
  1. Sign in to the AWS Marketplace console at https://console.aws.amazon.com/marketplace.

  2. Choose Manage subscriptions.

  3. On the Manage subscriptions page, choose Manage next to the connector subscription that you want to cancel.

  4. Choose Actions and then choose Cancel subscription.

  5. Select the check box to acknowledge that running instances are charged to your account, and then choose Yes, cancel subscription.

Developing custom connectors

You can write the code that reads data from or writes data to your data store and formats the data for use with AWS Glue Studio jobs. You can create connectors for Spark, Athena, and JDBC data stores. Sample code posted on GitHub provides an overview of the basic interfaces you need to implement.

You will need a local development environment for creating your connector code. You can use any IDE or even just a command line editor to write your connector. Examples of development environments include:

Developing Spark connectors

You can create a Spark connector with Spark DataSource API V2 (Spark 2.4) to read data.

To create a custom Spark connector

Follow the steps in the AWS Glue GitHub sample library for developing Spark connectors, which is located at https://github.com/aws-samples/aws-glue-samples/tree/master/GlueCustomConnectors/development/Spark/README.md.

Developing Athena connectors

You can create an Athena connector to be used by AWS Glue and AWS Glue Studio to query a custom data source.

To create a custom Athena connector

Follow the steps in the AWS Glue GitHub sample library for developing Athena connectors, which is located at https://github.com/aws-samples/aws-glue-samples/tree/master/GlueCustomConnectors/development/Athena.

Developing JDBC connectors

You can create a connector that uses JDBC to access your data stores.

To create a custom JDBC connector
  1. Install the AWS Glue Spark runtime libraries in your local development environment. Refer to the instructions in the AWS Glue GitHub sample library at https://github.com/aws-samples/aws-glue-samples/tree/master/GlueCustomConnectors/development/GlueSparkRuntime/README.md.

  2. Implement the JDBC driver that is responsible for retrieving the data from the data source. Refer to the Java Documentation for Java SE 8.

    Create an entry point within your code that AWS Glue Studio uses to locate your connector. The Class name field should be the full path of your JDBC driver.

  3. Use the GlueContext API to read data with the connector. Users can add more input options in the AWS Glue Studio console to configure the connection to the data source, if necessary. For a code example that shows how to read from and write to a JDBC database with a custom JDBC connector, see Custom and AWS Marketplace connectionType values.

Examples of using custom connectors with AWS Glue Studio

You can refer to the following blogs for examples of using custom connectors:

Developing AWS Glue connectors for AWS Marketplace

As an AWS partner, you can create custom connectors and upload them to AWS Marketplace to sell to AWS Glue customers.

The process for developing the connector code is the same as for custom connectors, but the process of uploading and verifying the connector code is more detailed. Refer to the instructions in Creating Connectors for AWS Marketplace on the GitHub website.

Restrictions for using connectors and connections in AWS Glue Studio

When you're using custom connectors or connectors from AWS Marketplace, take note of the following restrictions:

  • The testConnection API isn't supported with connections created for custom connectors.

  • Data Catalog connection password encryption isn't supported with custom connectors.

  • You can't use job bookmarks if you specify a filter predicate for a data source node that uses a JDBC connector.