AWS Glue Scala GlueContext APIs - AWS Glue

AWS Glue Scala GlueContext APIs

Package: com.amazonaws.services.glue

class GlueContext extends SQLContext(sc) ( @transient val sc : SparkContext, val defaultSourcePartitioner : PartitioningStrategy )

GlueContext is the entry point for reading and writing a DynamicFrame from and to Amazon Simple Storage Service (Amazon S3), the AWS Glue Data Catalog, JDBC, and so on. This class provides utility functions to create DataSource trait and DataSink objects that can in turn be used to read and write DynamicFrames.

You can also use GlueContext to set a target number of partitions (default 20) in the DynamicFrame if the number of partitions created from the source is less than a minimum threshold for partitions (default 10).

def getCatalogSink

def getCatalogSink( database : String, tableName : String, redshiftTmpDir : String = "", transformationContext : String = "" additionalOptions: JsonOptions = JsonOptions.empty, catalogId: String = null ) : DataSink

Creates a DataSink that writes to a location specified in a table that is defined in the Data Catalog.

  • database — The database name in the Data Catalog.

  • tableName — The table name in the Data Catalog.

  • redshiftTmpDir — The temporary staging directory to be used with certain data sinks. Set to empty by default.

  • transformationContext — The transformation context that is associated with the sink to be used by job bookmarks. Set to empty by default.

  • additionalOptions – Additional options provided to AWS Glue.

  • catalogId — The catalog ID (account ID) of the Data Catalog being accessed. When null, the default account ID of the caller is used.

Returns the DataSink.

def getCatalogSource

def getCatalogSource( database : String, tableName : String, redshiftTmpDir : String = "", transformationContext : String = "" pushDownPredicate : String = " " additionalOptions: JsonOptions = JsonOptions.empty, catalogId: String = null ) : DataSource

Creates a DataSource trait that reads data from a table definition in the Data Catalog.

  • database — The database name in the Data Catalog.

  • tableName — The table name in the Data Catalog.

  • redshiftTmpDir — The temporary staging directory to be used with certain data sinks. Set to empty by default.

  • transformationContext — The transformation context that is associated with the sink to be used by job bookmarks. Set to empty by default.

  • pushDownPredicate – Filters partitions without having to list and read all the files in your dataset. For more information, see Pre-Filtering Using Pushdown Predicates.

  • additionalOptions – Additional options provided to AWS Glue.

  • catalogId — The catalog ID (account ID) of the Data Catalog being accessed. When null, the default account ID of the caller is used.

Returns the DataSource.

def getJDBCSink

def getJDBCSink( catalogConnection : String, options : JsonOptions, redshiftTmpDir : String = "", transformationContext : String = "", catalogId: String = null ) : DataSink

Creates a DataSink that writes to a JDBC database that is specified in a Connection object in the Data Catalog. The Connection object has information to connect to a JDBC sink, including the URL, user name, password, VPC, subnet, and security groups.

  • catalogConnection — The name of the connection in the Data Catalog that contains the JDBC URL to write to.

  • options — A string of JSON name-value pairs that provide additional information that is required to write to a JDBC data store. This includes:

    • dbtable (required) — The name of the JDBC table. For JDBC data stores that support schemas within a database, specify schema.table-name. If a schema is not provided, then the default "public" schema is used. The following example shows an options parameter that points to a schema named test and a table named test_table in database test_db.

      options = JsonOptions("""{"dbtable": "test.test_table", "database": "test_db"}""")
    • database (required) — The name of the JDBC database.

    • Any additional options passed directly to the SparkSQL JDBC writer. For more information, see Redshift data source for Spark.

  • redshiftTmpDir — A temporary staging directory to be used with certain data sinks. Set to empty by default.

  • transformationContext — The transformation context that is associated with the sink to be used by job bookmarks. Set to empty by default.

  • catalogId — The catalog ID (account ID) of the Data Catalog being accessed. When null, the default account ID of the caller is used.

Example code:

getJDBCSink(catalogConnection = "my-connection-name", options = JsonOptions("""{"dbtable": "my-jdbc-table", "database": "my-jdbc-db"}"""), redshiftTmpDir = "", transformationContext = "datasink4")

Returns the DataSink.

def getSink

def getSink( connectionType : String, options : JsonOptions, transformationContext : String = "" ) : DataSink

Creates a DataSink that writes data to a destination like Amazon Simple Storage Service (Amazon S3), JDBC, or the AWS Glue Data Catalog.

Returns the DataSink.

def getSinkWithFormat

def getSinkWithFormat( connectionType : String, options : JsonOptions, transformationContext : String = "", format : String = null, formatOptions : JsonOptions = JsonOptions.empty ) : DataSink

Creates a DataSink that writes data to a destination like Amazon S3, JDBC, or the Data Catalog, and also sets the format for the data to be written out to the destination.

  • connectionType — The type of the connection. See Connection Types and Options for ETL in AWS Glue.

  • options — A string of JSON name-value pairs that provide additional information to establish a connection with the data sink. See Connection Types and Options for ETL in AWS Glue.

  • transformationContext — The transformation context that is associated with the sink to be used by job bookmarks. Set to empty by default.

  • format — The format of the data to be written out to the destination.

  • formatOptions — A string of JSON name-value pairs that provide additional options for formatting data at the destination. See Format Options.

Returns the DataSink.

def getSource

def getSource( connectionType : String, connectionOptions : JsonOptions, transformationContext : String = "" pushDownPredicate ) : DataSource

Creates a DataSource trait that reads data from a source like Amazon S3, JDBC, or the AWS Glue Data Catalog.

  • connectionType — The type of the data source. See Connection Types and Options for ETL in AWS Glue.

  • connectionOptions — A string of JSON name-value pairs that provide additional information for establishing a connection with the data source. See Connection Types and Options for ETL in AWS Glue.

  • transformationContext — The transformation context that is associated with the sink to be used by job bookmarks. Set to empty by default.

  • pushDownPredicate — Predicate on partition columns.

Returns the DataSource.

def getSourceWithFormat

def getSourceWithFormat( connectionType : String, options : JsonOptions, transformationContext : String = "", format : String = null, formatOptions : JsonOptions = JsonOptions.empty ) : DataSource

Creates a DataSource trait that reads data from a source like Amazon S3, JDBC, or the AWS Glue Data Catalog, and also sets the format of data stored in the source.

  • connectionType — The type of the data source. See Connection Types and Options for ETL in AWS Glue.

  • options — A string of JSON name-value pairs that provide additional information for establishing a connection with the data source. See Connection Types and Options for ETL in AWS Glue.

  • transformationContext — The transformation context that is associated with the sink to be used by job bookmarks. Set to empty by default.

  • format — The format of the data that is stored at the source. When the connectionType is "s3", you can also specify format. Can be one of “avro”, “csv”, “grokLog”, “ion”, “json”, “xml”, “parquet”, or “orc”.

  • formatOptions — A string of JSON name-value pairs that provide additional options for parsing data at the source. See Format Options.

Returns the DataSource.

def getSparkSession

def getSparkSession : SparkSession

Gets the SparkSession object associated with this GlueContext. Use this SparkSession object to register tables and UDFs for use with DataFrame created from DynamicFrames.

Returns the SparkSession.

def this

def this( sc : SparkContext, minPartitions : Int, targetPartitions : Int )

Creates a GlueContext object using the specified SparkContext, minimum partitions, and target partitions.

  • sc — The SparkContext.

  • minPartitions — The minimum number of partitions.

  • targetPartitions — The target number of partitions.

Returns the GlueContext.

def this

def this( sc : SparkContext )

Creates a GlueContext object with the provided SparkContext. Sets the minimum partitions to 10 and target partitions to 20.

  • sc — The SparkContext.

Returns the GlueContext.

def this

def this( sparkContext : JavaSparkContext )

Creates a GlueContext object with the provided JavaSparkContext. Sets the minimum partitions to 10 and target partitions to 20.

  • sparkContext — The JavaSparkContext.

Returns the GlueContext.