JDBC connections
Certain, typically relational, database types support connecting through the JDBC standard. For more
information about JDBC, see the Java
JDBC API
The JDBC connectionType values include the following:
-
"connectionType": "sqlserver"
: Designates a connection to a Microsoft SQL Server database. -
"connectionType": "mysql"
: Designates a connection to a MySQL database. -
"connectionType": "oracle"
: Designates a connection to an Oracle database. -
"connectionType": "postgresql"
: Designates a connection to a PostgreSQL database. -
"connectionType": "redshift"
: Designates a connection to an Amazon Redshift database. For more information, see Redshift connections.
The following table lists the JDBC driver versions that AWS Glue supports.
Product | JDBC driver versions for Glue 4.0 | JDBC driver versions for Glue 3.0 | JDBC driver versions for Glue 0.9, 1.0, 2.0 |
---|---|---|---|
Microsoft SQL Server | 9.4.0 | 7.x | 6.x |
MySQL | 8.0.23 | 8.0.23 | 5.1 |
Oracle Database | 21.7 | 21.1 | 11.2 |
PostgreSQL | 42.3.6 | 42.2.18 | 42.1.x |
MongoDB | 4.7.2 | 4.0.0 | 2.0.0 |
Amazon Redshift * | redshift-jdbc42-2.1.0.16 | redshift-jdbc41-1.2.12.1017 | redshift-jdbc41-1.2.12.1017 |
* For the Amazon Redshift connection type, all other option name/value pairs that are included in connection options for
a JDBC connection, including formatting options, are passed directly to the underlying SparkSQL DataSource. In
AWS Glue with Spark jobs in AWS Glue 4.0 and later versions, the AWS Glue native connector for Amazon Redshift uses the Amazon Redshift
integration for Apache Spark. For more information see Amazon Redshift integration for
Apache Spark. In previous versions, see Amazon Redshift
data source for Spark
To configure your Amazon VPC to connect to Amazon RDS data stores using JDBC, refer to Setting up Amazon VPC for JDBC connections to Amazon RDS data stores from AWS Glue.
Note
AWS Glue jobs are only associated with one subnet during a run. This may impact your ability to connect to multiple data sources through the same job. This behavior is not limited to JDBC sources.
Topics
JDBC connection option reference
If you already have a JDBC AWS Glue connection defined, you can reuse the configuration properties defined in it, such as: url, user and password; so you don't have to specify them in the code as connection options. This feature is available in AWS Glue 3.0 and later versions. To do so, use the following connection properties:
-
"useConnectionProperties"
: Set it to "true" to indicate you want to use the configuration from a connection. -
"connectionName"
: Enter the connection name to retrieve the configuration from, the connection must be defined in the same region as the job.
Use these connection options with JDBC connections:
-
"url"
: (Required) The JDBC URL for the database. -
"dbtable"
: (Required) The database table to read from. For JDBC data stores that support schemas within a database, specifyschema.table-name
. If a schema is not provided, then the default "public" schema is used. -
"user"
: (Required) The user name to use when connecting. -
"password"
: (Required) The password to use when connecting. -
(Optional) The following options allow you to supply a custom JDBC driver. Use these options if you must use a driver that AWS Glue does not natively support.
ETL jobs can use different JDBC driver versions for the data source and target, even if the source and target are the same database product. This allows you to migrate data between source and target databases with different versions. To use these options, you must first upload the JAR file of the JDBC driver to Amazon S3.
-
"customJdbcDriverS3Path"
: The Amazon S3 path of the custom JDBC driver. -
"customJdbcDriverClassName"
: The class name of JDBC driver.
-
-
"bulkSize"
: (Optional) Used to configure parallel inserts for speeding up bulk loads into JDBC targets. Specify an integer value for the degree of parallelism to use when writing or inserting data. This option is helpful for improving the performance of writes into databases such as the Arch User Repository (AUR). -
"hashfield"
(Optional) A string, used to specify the name of a column in the JDBC table to be used to divide the data into partitions when reading from JDBC tables in parallel. Provide "hashfield" OR "hashexpression". For more information, see Reading from JDBC tables in parallel. -
"hashexpression"
(Optional) A SQL select clause returning a whole number. Used to divide the data in a JDBC table into partitions when reading from JDBC tables in parallel. Provide "hashfield" OR "hashexpression". For more information, see Reading from JDBC tables in parallel. -
"hashpartitions"
(Optional) A positive integer. Used to specify the number of parallel reads of the JDBC table when reading from JDBC tables in parallel. Default: 7. For more information, see Reading from JDBC tables in parallel. -
"sampleQuery"
: (Optional) A custom SQL query statement. Used to specify a subset of information in a table to retrieve a sample of the table contents. When configured without regard to your data, it can be less efficient than DynamicFrame methods, causing timeouts or out of memory errors. For more information, see Use sampleQuery. -
"enablePartitioningForSampleQuery"
: (Optional) A boolean. Default: false. Used to enable reading from JDBC tables in parallel when specifyingsampleQuery
. If set to true,sampleQuery
must end with "where" or "and" for AWS Glue to append partitioning conditions. For more information, see Use sampleQuery. -
"sampleSize"
: (Optional) A positive integer. Limits the number of rows returned by the sample query. Works only whenenablePartitioningForSampleQuery
is true. If partitioning is not enabled, you should instead directly add"limit x"
in thesampleQuery
to limit the size. For more information, see Use sampleQuery.
Use sampleQuery
This section explains how to use sampleQuery
, sampleSize
and
enablePartitioningForSampleQuery
.
sampleQuery
can be an efficient way to sample a few rows of your dataset. By default, the
query is run by a single executor. When configured without regard to your data, it can be less efficient
than DynamicFrame methods, causing timeouts or out of memory errors. Running SQL on the underlying database
as part of your ETL pipeline is generally only needed for performance purposes. If you are trying to
preview a few rows of your dataset, consider using show. If you are trying to transform your
dataset using SQL, consider using toDF to define a SparkSQL transform against your data in a DataFrame form.
While your query may manipulate a variety of tables, dbtable
remains required.
Using sampleQuery to retrieve a sample of your table
When using default sampleQuery behavior to retrieve a sample of your data, AWS Glue does not expect
substantial throughput, so it runs your query on a single executor. In order to limit the data you provide
and not cause performance problems, we suggest you provide SQL with a LIMIT
clause.
Example Use sampleQuery without partitioning
The following code example shows how to use sampleQuery
without
partitioning.
//A full sql query statement. val query = "select name from
$tableName
where age > 0 limit 1" val connectionOptions = JsonOptions(Map( "url" -> url, "dbtable" -> tableName, "user" -> user, "password" -> password, "sampleQuery" -> query )) val dyf = glueContext.getSource("mysql", connectionOptions) .getDynamicFrame()
Using sampleQuery against larger datasets
If you're reading a large dataset, you might need to enable JDBC partitioning to query a table in
parallel. For more information, see Reading from JDBC tables in parallel. To use
sampleQuery
with JDBC partitioning, set enablePartitioningForSampleQuery
to
true. Enabling this feature requires you to make some changes to your sampleQuery
.
When using JDBC partitioning with sampleQuery
, your query must end with "where" or "and" for AWS Glue to
append partitioning conditions.
If you would like to limit the results of your sampleQuery when reading from JDBC tables in parallel, set
the "sampleSize"
parameter rather than specifying a LIMIT
clause.
Example Use sampleQuery with JDBC partitioning
The following code example shows how to use sampleQuery
with JDBC
partitioning.
//note that the query should end with "where" or "and" if use with JDBC partitioning. val query = "select name from
$tableName
where age > 0 and" //Enable JDBC partitioning by setting hashfield. //to use sampleQuery with partitioning, set enablePartitioningForSampleQuery. //use sampleSize to limit the size of returned data. val connectionOptions = JsonOptions(Map( "url" -> url, "dbtable" -> tableName, "user" -> user, "password" -> password, "hashfield" -> primaryKey, "sampleQuery" -> query, "enablePartitioningForSampleQuery" -> true, "sampleSize" -> "1" )) val dyf = glueContext.getSource("mysql", connectionOptions) .getDynamicFrame()
Notes and Restrictions:
Sample queries cannot be used together with job bookmarks. The bookmark state will be ignored when configuration for both are provided.
Use custom JDBC driver
The following code examples show how to read from and write to JDBC databases with custom JDBC drivers. They demonstrate reading from one version of a database product, and writing to a later version of the same product.