JDBC connections - AWS Glue

JDBC connections

Certain, typically relational, database types support connecting through the JDBC standard. For more information about JDBC, see the Java JDBC API documentation. AWS Glue natively supports connecting to certain databases through their JDBC connectors - the JDBC libraries are provided in AWS Glue Spark jobs. When connecting to these database types using AWS Glue libraries, you have access to a standard set of options.

The JDBC connectionType values include the following:

  • "connectionType": "sqlserver": Designates a connection to a Microsoft SQL Server database.

  • "connectionType": "mysql": Designates a connection to a MySQL database.

  • "connectionType": "oracle": Designates a connection to an Oracle database.

  • "connectionType": "postgresql": Designates a connection to a PostgreSQL database.

  • "connectionType": "redshift": Designates a connection to an Amazon Redshift database. For more information, see Redshift connections.

The following table lists the JDBC driver versions that AWS Glue supports.

Product JDBC driver versions for Glue 4.0 JDBC driver versions for Glue 3.0 JDBC driver versions for Glue 0.9, 1.0, 2.0
Microsoft SQL Server 9.4.0 7.x 6.x
MySQL 8.0.23 8.0.23 5.1
Oracle Database 21.7 21.1 11.2
PostgreSQL 42.3.6 42.2.18 42.1.x
MongoDB 4.7.2 4.0.0 2.0.0
Amazon Redshift * redshift-jdbc42-2.1.0.16 redshift-jdbc41-1.2.12.1017 redshift-jdbc41-1.2.12.1017

* For the Amazon Redshift connection type, all other option name/value pairs that are included in connection options for a JDBC connection, including formatting options, are passed directly to the underlying SparkSQL DataSource. In AWS Glue with Spark jobs in AWS Glue 4.0 and later versions, the AWS Glue native connector for Amazon Redshift uses the Amazon Redshift integration for Apache Spark. For more information see Amazon Redshift integration for Apache Spark. In previous versions, see Amazon Redshift data source for Spark.

To configure your Amazon VPC to connect to Amazon RDS data stores using JDBC, refer to Setting up Amazon VPC for JDBC connections to Amazon RDS data stores from AWS Glue.

Note

AWS Glue jobs are only associated with one subnet during a run. This may impact your ability to connect to multiple data sources through the same job. This behavior is not limited to JDBC sources.

JDBC connection option reference

If you already have a JDBC AWS Glue connection defined, you can reuse the configuration properties defined in it, such as: url, user and password; so you don't have to specify them in the code as connection options. This feature is available in AWS Glue 3.0 and later versions. To do so, use the following connection properties:

  • "useConnectionProperties": Set it to "true" to indicate you want to use the configuration from a connection.

  • "connectionName": Enter the connection name to retrieve the configuration from, the connection must be defined in the same region as the job.

Use these connection options with JDBC connections:

  • "url": (Required) The JDBC URL for the database.

  • "dbtable": (Required) The database table to read from. For JDBC data stores that support schemas within a database, specify schema.table-name. If a schema is not provided, then the default "public" schema is used.

  • "user": (Required) The user name to use when connecting.

  • "password": (Required) The password to use when connecting.

  • (Optional) The following options allow you to supply a custom JDBC driver. Use these options if you must use a driver that AWS Glue does not natively support.

    ETL jobs can use different JDBC driver versions for the data source and target, even if the source and target are the same database product. This allows you to migrate data between source and target databases with different versions. To use these options, you must first upload the JAR file of the JDBC driver to Amazon S3.

    • "customJdbcDriverS3Path": The Amazon S3 path of the custom JDBC driver.

    • "customJdbcDriverClassName": The class name of JDBC driver.

  • "bulkSize": (Optional) Used to configure parallel inserts for speeding up bulk loads into JDBC targets. Specify an integer value for the degree of parallelism to use when writing or inserting data. This option is helpful for improving the performance of writes into databases such as the Arch User Repository (AUR).

  • "hashfield" (Optional) A string, used to specify the name of a column in the JDBC table to be used to divide the data into partitions when reading from JDBC tables in parallel. Provide "hashfield" OR "hashexpression". For more information, see Reading from JDBC tables in parallel.

  • "hashexpression" (Optional) A SQL select clause returning a whole number. Used to divide the data in a JDBC table into partitions when reading from JDBC tables in parallel. Provide "hashfield" OR "hashexpression". For more information, see Reading from JDBC tables in parallel.

  • "hashpartitions" (Optional) A positive integer. Used to specify the number of parallel reads of the JDBC table when reading from JDBC tables in parallel. Default: 7. For more information, see Reading from JDBC tables in parallel.

  • "sampleQuery": (Optional) A custom SQL query statement. Used to specify a subset of information in a table to retrieve a sample of the table contents. When configured without regard to your data, it can be less efficient than DynamicFrame methods, causing timeouts or out of memory errors. For more information, see Use sampleQuery.

  • "enablePartitioningForSampleQuery": (Optional) A boolean. Default: false. Used to enable reading from JDBC tables in parallel when specifying sampleQuery. If set to true, sampleQuery must end with "where" or "and" for AWS Glue to append partitioning conditions. For more information, see Use sampleQuery.

  • "sampleSize": (Optional) A positive integer. Limits the number of rows returned by the sample query. Works only when enablePartitioningForSampleQuery is true. If partitioning is not enabled, you should instead directly add "limit x" in the sampleQuery to limit the size. For more information, see Use sampleQuery.

Use sampleQuery

This section explains how to use sampleQuery, sampleSize and enablePartitioningForSampleQuery.

sampleQuery can be an efficient way to sample a few rows of your dataset. By default, the query is run by a single executor. When configured without regard to your data, it can be less efficient than DynamicFrame methods, causing timeouts or out of memory errors. Running SQL on the underlying database as part of your ETL pipeline is generally only needed for performance purposes. If you are trying to preview a few rows of your dataset, consider using show. If you are trying to transform your dataset using SQL, consider using toDF to define a SparkSQL transform against your data in a DataFrame form.

While your query may manipulate a variety of tables, dbtable remains required.

Using sampleQuery to retrieve a sample of your table

When using default sampleQuery behavior to retrieve a sample of your data, AWS Glue does not expect substantial throughput, so it runs your query on a single executor. In order to limit the data you provide and not cause performance problems, we suggest you provide SQL with a LIMIT clause.

Example Use sampleQuery without partitioning

The following code example shows how to use sampleQuery without partitioning.

//A full sql query statement. val query = "select name from $tableName where age > 0 limit 1" val connectionOptions = JsonOptions(Map( "url" -> url, "dbtable" -> tableName, "user" -> user, "password" -> password, "sampleQuery" -> query )) val dyf = glueContext.getSource("mysql", connectionOptions) .getDynamicFrame()

Using sampleQuery against larger datasets

If you're reading a large dataset, you might need to enable JDBC partitioning to query a table in parallel. For more information, see Reading from JDBC tables in parallel. To use sampleQuery with JDBC partitioning, set enablePartitioningForSampleQuery to true. Enabling this feature requires you to make some changes to your sampleQuery.

When using JDBC partitioning with sampleQuery, your query must end with "where" or "and" for AWS Glue to append partitioning conditions.

If you would like to limit the results of your sampleQuery when reading from JDBC tables in parallel, set the "sampleSize" parameter rather than specifying a LIMIT clause.

Example Use sampleQuery with JDBC partitioning

The following code example shows how to use sampleQuery with JDBC partitioning.

//note that the query should end with "where" or "and" if use with JDBC partitioning. val query = "select name from $tableName where age > 0 and" //Enable JDBC partitioning by setting hashfield. //to use sampleQuery with partitioning, set enablePartitioningForSampleQuery. //use sampleSize to limit the size of returned data. val connectionOptions = JsonOptions(Map( "url" -> url, "dbtable" -> tableName, "user" -> user, "password" -> password, "hashfield" -> primaryKey, "sampleQuery" -> query, "enablePartitioningForSampleQuery" -> true, "sampleSize" -> "1" )) val dyf = glueContext.getSource("mysql", connectionOptions) .getDynamicFrame()

Notes and Restrictions:

Sample queries cannot be used together with job bookmarks. The bookmark state will be ignored when configuration for both are provided.

Use custom JDBC driver

The following code examples show how to read from and write to JDBC databases with custom JDBC drivers. They demonstrate reading from one version of a database product, and writing to a later version of the same product.

Python
import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext, SparkConf from awsglue.context import GlueContext from awsglue.job import Job import time from pyspark.sql.types import StructType, StructField, IntegerType, StringType sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session # Construct JDBC connection options connection_mysql5_options = { "url": "jdbc:mysql://<jdbc-host-name>:3306/db", "dbtable": "test", "user": "admin", "password": "pwd"} connection_mysql8_options = { "url": "jdbc:mysql://<jdbc-host-name>:3306/db", "dbtable": "test", "user": "admin", "password": "pwd", "customJdbcDriverS3Path": "s3://DOC-EXAMPLE-BUCKET/mysql-connector-java-8.0.17.jar", "customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver"} connection_oracle11_options = { "url": "jdbc:oracle:thin:@//<jdbc-host-name>:1521/ORCL", "dbtable": "test", "user": "admin", "password": "pwd"} connection_oracle18_options = { "url": "jdbc:oracle:thin:@//<jdbc-host-name>:1521/ORCL", "dbtable": "test", "user": "admin", "password": "pwd", "customJdbcDriverS3Path": "s3://DOC-EXAMPLE-BUCKET/ojdbc10.jar", "customJdbcDriverClassName": "oracle.jdbc.OracleDriver"} # Read from JDBC databases with custom driver df_mysql8 = glueContext.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection_mysql8_options) # Read DynamicFrame from MySQL 5 and write to MySQL 8 df_mysql5 = glueContext.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection_mysql5_options) glueContext.write_from_options(frame_or_dfc=df_mysql5, connection_type="mysql", connection_options=connection_mysql8_options) # Read DynamicFrame from Oracle 11 and write to Oracle 18 df_oracle11 = glueContext.create_dynamic_frame.from_options(connection_type="oracle", connection_options=connection_oracle11_options) glueContext.write_from_options(frame_or_dfc=df_oracle11, connection_type="oracle", connection_options=connection_oracle18_options)
Scala
import com.amazonaws.services.glue.GlueContext import com.amazonaws.services.glue.MappingSpec import com.amazonaws.services.glue.errors.CallSite import com.amazonaws.services.glue.util.GlueArgParser import com.amazonaws.services.glue.util.Job import com.amazonaws.services.glue.util.JsonOptions import com.amazonaws.services.glue.DynamicFrame import org.apache.spark.SparkContext import scala.collection.JavaConverters._ object GlueApp { val MYSQL_5_URI: String = "jdbc:mysql://<jdbc-host-name>:3306/db" val MYSQL_8_URI: String = "jdbc:mysql://<jdbc-host-name>:3306/db" val ORACLE_11_URI: String = "jdbc:oracle:thin:@//<jdbc-host-name>:1521/ORCL" val ORACLE_18_URI: String = "jdbc:oracle:thin:@//<jdbc-host-name>:1521/ORCL" // Construct JDBC connection options lazy val mysql5JsonOption = jsonOptions(MYSQL_5_URI) lazy val mysql8JsonOption = customJDBCDriverJsonOptions(MYSQL_8_URI, "s3://DOC-EXAMPLE-BUCKET/mysql-connector-java-8.0.17.jar", "com.mysql.cj.jdbc.Driver") lazy val oracle11JsonOption = jsonOptions(ORACLE_11_URI) lazy val oracle18JsonOption = customJDBCDriverJsonOptions(ORACLE_18_URI, "s3://DOC-EXAMPLE-BUCKET/ojdbc10.jar", "oracle.jdbc.OracleDriver") def main(sysArgs: Array[String]): Unit = { val spark: SparkContext = new SparkContext() val glueContext: GlueContext = new GlueContext(spark) val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray) Job.init(args("JOB_NAME"), glueContext, args.asJava) // Read from JDBC database with custom driver val df_mysql8: DynamicFrame = glueContext.getSource("mysql", mysql8JsonOption).getDynamicFrame() // Read DynamicFrame from MySQL 5 and write to MySQL 8 val df_mysql5: DynamicFrame = glueContext.getSource("mysql", mysql5JsonOption).getDynamicFrame() glueContext.getSink("mysql", mysql8JsonOption).writeDynamicFrame(df_mysql5) // Read DynamicFrame from Oracle 11 and write to Oracle 18 val df_oracle11: DynamicFrame = glueContext.getSource("oracle", oracle11JsonOption).getDynamicFrame() glueContext.getSink("oracle", oracle18JsonOption).writeDynamicFrame(df_oracle11) Job.commit() } private def jsonOptions(url: String): JsonOptions = { new JsonOptions( s"""{"url": "${url}", |"dbtable":"test", |"user": "admin", |"password": "pwd"}""".stripMargin) } private def customJDBCDriverJsonOptions(url: String, customJdbcDriverS3Path: String, customJdbcDriverClassName: String): JsonOptions = { new JsonOptions( s"""{"url": "${url}", |"dbtable":"test", |"user": "admin", |"password": "pwd", |"customJdbcDriverS3Path": "${customJdbcDriverS3Path}", |"customJdbcDriverClassName" : "${customJdbcDriverClassName}"}""".stripMargin) } }