Creating custom connectors - AWS Glue

Creating custom connectors

You can also build your own connector and then upload the connector code to AWS Glue Studio.

Custom connectors are integrated into AWS Glue Studio through the AWS Glue Spark runtime API. The AWS Glue Spark runtime allows you to plug in any connector that is compliant with the Spark, Athena, or JDBC interface. It allows you to pass in any connection option that is available with the custom connector.

You can encapsulate all your connection properties with AWS Glue Connections and supply the connection name to your ETL job. Integration with Data Catalog connections allows you to use the same connection properties across multiple calls in a single Spark application or across different applications.

You can specify additional options for the connection. The job script that AWS Glue Studio generates contains a Datasource entry that uses the connection to plug in your connector with the specified connection options. For example:

Datasource = glueContext.create_dynamic_frame.from_options(connection_type = "custom.jdbc", connection_options = {"dbTable":"Account","connectionName":"my-custom-jdbc- connection"}, transformation_ctx = "DataSource0")
To add a custom connector to AWS Glue Studio
  1. Create the code for your custom connector. For more information, see Developing custom connectors.

  2. Add support for AWS Glue features to your connector. Here are some examples of these features and how they are used within the job script generated by AWS Glue Studio:

    • Data type mapping – Your connector can typecast the columns while reading them from the underlying data store. For example, a dataTypeMapping of {"INTEGER":"STRING"} converts all columns of type Integer to columns of type String when parsing the records and constructing the DynamicFrame. This helps users to cast columns to types of their choice.

      DataSource0 = glueContext.create_dynamic_frame.from_options(connection_type = "custom.jdbc", connection_options = {"dataTypeMapping":{"INTEGER":"STRING"}", connectionName":"test-connection-jdbc"}, transformation_ctx = "DataSource0")
    • Partitioning for parallel reads – AWS Glue allows parallel data reads from the data store by partitioning the data on a column. You must specify the partition column, the lower partition bound, the upper partition bound, and the number of partitions. This feature enables you to make use of data parallelism and multiple Spark executors allocated for the Spark application.

      DataSource0 = glueContext.create_dynamic_frame.from_options(connection_type = "custom.jdbc", connection_options = {"upperBound":"200","numPartitions":"4", "partitionColumn":"id","lowerBound":"0","connectionName":"test-connection-jdbc"}, transformation_ctx = "DataSource0")
    • Use AWS Secrets Manager for storing credentials –The Data Catalog connection can also contain a secretId for a secret stored in AWS Secrets Manager. The AWS secret can securely store authentication and credentials information and provide it to AWS Glue at runtime. Alternatively, you can specify the secretId from the Spark script as follows:

      DataSource = glueContext.create_dynamic_frame.from_options(connection_type = "custom.jdbc", connection_options = {"connectionName":"test-connection-jdbc", "secretId"-> "my-secret-id"}, transformation_ctx = "DataSource0")
    • Filtering the source data with row predicates and column projections – The AWS Glue Spark runtime also allows users to push down SQL queries to filter data at the source with row predicates and column projections. This allows your ETL job to load filtered data faster from data stores that support push-downs. An example SQL query pushed down to a JDBC data source is: SELECT id, name, department FROM department WHERE id < 200.

      DataSource = glueContext.create_dynamic_frame.from_options(connection_type = "custom.jdbc", connection_options = {"query":"SELECT id, name, department FROM department WHERE id < 200","connectionName":"test-connection-jdbc"}, transformation_ctx = "DataSource0")
    • Job bookmarks – AWS Glue supports incremental loading of data from JDBC sources. AWS Glue keeps track of the last processed record from the data store, and processes new data records in the subsequent ETL job runs. Job bookmarks use the primary key as the default column for the bookmark key, provided that this column increases or decreases sequentially. For more information about job bookmarks, see Job Bookmarks in the AWS Glue Developer Guide.

      DataSource0 = glueContext.create_dynamic_frame.from_options(connection_type = "custom.jdbc", connection_options = {"jobBookmarkKeys":["empno"], "jobBookmarkKeysSortOrder" :"asc", "connectionName":"test-connection-jdbc"}, transformation_ctx = "DataSource0")
  3. Package the custom connector as a JAR file and upload the file to Amazon S3.

  4. Test your custom connector. For more information, see the instructions on GitHub at Glue Custom Connectors: Local Validation Tests Guide.

  5. In the AWS Glue Studio console, choose Connectors in the console navigation pane.

  6. On the Connectors page, choose Create custom connector.

  7. On the Create custom connector page, enter the following information:

    • The path to the location of the custom code JAR file in Amazon S3.

    • A name for the connector that will be used by AWS Glue Studio.

    • Your connector type, which can be one of JDBC, Spark, or Athena.

    • The name of the entry point within your custom code that AWS Glue Studio calls to use the connector.

      • For JDBC connectors, this field should be the class name of your JDBC driver.

      • For Spark connectors, this field should be the fully qualified data source class name, or its alias, that you use when loading the Spark data source with the format operator.

    • (JDBC only) The base URL used by the JDBC connection for the data store.

    • (Optional) A description of the custom connector.

  8. Choose Create connector.

  9. From the Connectors page, create a connection that uses this connector, as described in Creating connections for connectors.

Adding connectors to AWS Glue Studio

A connector is a piece of code that facilitates communication between your data store and AWS Glue. You can either subscribe to a connector offered in AWS Marketplace, or you can create your own custom connector.

Subscribing to AWS Marketplace connectors

AWS Glue Studio makes it easy to add connectors from AWS Marketplace.

To add a connector from AWS Marketplace to AWS Glue Studio
  1. In the AWS Glue Studio console, choose Connectors in the console navigation pane.

  2. On the Connectors page, choose Go to AWS Marketplace.

  3. In AWS Marketplace, in Featured products, choose the connector you want to use. You can choose one of the featured connectors, or use search. You can search on the name or type of connector, and you can use options to refine the search results.

    If you want to use one of the featured connectors, choose View product. If you used search to locate a connector, then choose the name of the connector.

  4. On the product page for the connector, use the tabs to view information about the connector. If you decide to purchase this connector, choose Continue to Subscribe.

  5. Provide the payment information, and then choose Continue to Configure.

  6. On the Configure this software page, choose the method of deployment and the version of the connector to use. Then choose Continue to Launch.

  7. On the Launch this software page, you can review the Usage Instructions provided by the connector provider. When you're ready to continue, choose Activate connection in AWS Glue Studio.

    After a small amount of time, the console displays the Create marketplace connection page in AWS Glue Studio.

  8. Create a connection that uses this connector, as described in Creating connections for connectors.

    Alternatively, you can choose Activate connector only to skip creating a connection at this time. You must create a connection at a later date before you can use the connector.