GlueContext Class - AWS Glue

GlueContext Class

Wraps the Apache SparkSQL SQLContext object, and thereby provides mechanisms for interacting with the Apache Spark platform.

Working with Datasets in Amazon S3

purge_table

purge_table(database, table_name, options={}, transformation_ctx="", catalog_id=None)

Deletes files from Amazon S3 for the specified catalog's database and table. If all files in a partition are deleted, that partition is also deleted from the catalog.

If you want to be able to recover deleted objects, you can enable object versioning on the Amazon S3 bucket. When an object is deleted from a bucket that doesn't have object versioning enabled, the object can't be recovered. For more information about how to recover deleted objects in a version-enabled bucket, see How can I retrieve an Amazon S3 object that was deleted? in the AWS Support Knowledge Center.

  • database – The database to use.

  • table_name – The name of the table to use.

  • options – Options to filter files to be deleted and for manifest file generation.

    • retentionPeriod – Specifies a period in number of hours to retain files. Files newer than the retention period are retained. Set to 168 hours (7 days) by default.

    • partitionPredicate – Partitions satisfying this predicate are deleted. Files within the retention period in these partitions are not deleted. Set to "" – empty by default.

    • excludeStorageClasses – Files with storage class in the excludeStorageClasses set are not deleted. The default is Set() – an empty set.

    • manifestFilePath – An optional path for manifest file generation. All files that were successfully purged are recorded in Success.csv, and those that failed in Failed.csv

  • transformation_ctx – The transformation context to use (optional). Used in the manifest file path.

  • catalog_id – The catalog ID of the Data Catalog being accessed (the account ID of the Data Catalog). Set to None by default. None defaults to the catalog ID of the calling account in the service.

glueContext.purge_table("database", "table", {"partitionPredicate": "(month=='march')", "retentionPeriod": 1, "excludeStorageClasses": ["STANDARD_IA"], "manifestFilePath": "s3://bucketmanifest/"})

purge_s3_path

purge_s3_path(s3_path, options={}, transformation_ctx="")

Deletes files from the specified Amazon S3 path recursively.

If you want to be able to recover deleted objects, you can enable object versioning on the Amazon S3 bucket. When an object is deleted from a bucket that doesn't have object versioning enabled, the object can't be recovered. For more information about how to recover deleted objects in a version-enabled bucket, see How can I retrieve an Amazon S3 object that was deleted? in the AWS Support Knowledge Center.

  • s3_path – The path in Amazon S3 of the files to be deleted in the format s3://<bucket>/<prefix>/

  • options – Options to filter files to be deleted and for manifest file generation.

    • retentionPeriod – Specifies a period in number of hours to retain files. Files newer than the retention period are retained. Set to 168 hours (7 days) by default.

    • partitionPredicate – Partitions satisfying this predicate are deleted. Files within the retention period in these partitions are not deleted. Set to "" – empty by default.

    • excludeStorageClasses – Files with storage class in the excludeStorageClasses set are not deleted. The default is Set() – an empty set.

    • manifestFilePath – An optional path for manifest file generation. All files that were successfully purged are recorded in Success.csv, and those that failed in Failed.csv

  • transformation_ctx – The transformation context to use (optional). Used in the manifest file path.

  • catalog_id – The catalog ID of the Data Catalog being accessed (the account ID of the Data Catalog). Set to None by default. None defaults to the catalog ID of the calling account in the service.

glueContext.purge_s3_path("s3://bucket/path/", {"retentionPeriod": 1, "excludeStorageClasses": ["STANDARD_IA"], "manifestFilePath": "s3://bucketmanifest/"})

transition_table

transition_table(database, table_name, transition_to, options={}, transformation_ctx="", catalog_id=None)

Transitions the storage class of the files stored on Amazon S3 for the specified catalog's database and table.

You can transition between any two storage classes. For the GLACIER and DEEP_ARCHIVE storage classes, you can transition to these classes. However, you would use an S3 RESTORE to transition from GLACIER and DEEP_ARCHIVE storage classes.

If you're running AWS Glue ETL jobs that read files or partitions from Amazon S3, you can exclude some Amazon S3 storage class types. For more information, see Excluding Amazon S3 Storage Classes.

  • database – The database to use.

  • table_name – The name of the table to use.

  • transition_to – The Amazon S3 storage class to transition to.

  • options – Options to filter files to be deleted and for manifest file generation.

    • retentionPeriod – Specifies a period in number of hours to retain files. Files newer than the retention period are retained. Set to 168 hours (7 days) by default.

    • partitionPredicate – Partitions satisfying this predicate are transitioned. Files within the retention period in these partitions are not transitioned. Set to "" – empty by default.

    • excludeStorageClasses – Files with storage class in the excludeStorageClasses set are not transitioned. The default is Set() – an empty set.

    • manifestFilePath – An optional path for manifest file generation. All files that were successfully transitioned are recorded in Success.csv, and those that failed in Failed.csv

    • accountId – The AWS account ID to run the transition transform. Mandatory for this transform.

    • roleArn – The AWS role to run the transition transform. Mandatory for this transform.

  • transformation_ctx – The transformation context to use (optional). Used in the manifest file path.

  • catalog_id – The catalog ID of the Data Catalog being accessed (the account ID of the Data Catalog). Set to None by default. None defaults to the catalog ID of the calling account in the service.

glueContext.transition_table("database", "table", "STANDARD_IA", {"retentionPeriod": 1, "excludeStorageClasses": ["STANDARD_IA"], "manifestFilePath": "s3://bucketmanifest/", "accountId": "12345678901", "roleArn": "arn:aws:iam::123456789012:user/example-username"})

transition_s3_path

transition_s3_path(s3_path, transition_to, options={}, transformation_ctx="")

Transitions the storage class of the files in the specified Amazon S3 path recursively.

You can transition between any two storage classes. For the GLACIER and DEEP_ARCHIVE storage classes, you can transition to these classes. However, you would use an S3 RESTORE to transition from GLACIER and DEEP_ARCHIVE storage classes.

If you're running AWS Glue ETL jobs that read files or partitions from Amazon S3, you can exclude some Amazon S3 storage class types. For more information, see Excluding Amazon S3 Storage Classes.

  • s3_path – The path in Amazon S3 of the files to be transitioned in the format s3://<bucket>/<prefix>/

  • transition_to – The Amazon S3 storage class to transition to.

  • options – Options to filter files to be deleted and for manifest file generation.

    • retentionPeriod – Specifies a period in number of hours to retain files. Files newer than the retention period are retained. Set to 168 hours (7 days) by default.

    • partitionPredicate – Partitions satisfying this predicate are transitioned. Files within the retention period in these partitions are not transitioned. Set to "" – empty by default.

    • excludeStorageClasses – Files with storage class in the excludeStorageClasses set are not transitioned. The default is Set() – an empty set.

    • manifestFilePath – An optional path for manifest file generation. All files that were successfully transitioned are recorded in Success.csv, and those that failed in Failed.csv

    • accountId – The AWS account ID to run the transition transform. Mandatory for this transform.

    • roleArn – The AWS role to run the transition transform. Mandatory for this transform.

  • transformation_ctx – The transformation context to use (optional). Used in the manifest file path.

glueContext.transition_s3_path("s3://bucket/prefix/", "STANDARD_IA", {"retentionPeriod": 1, "excludeStorageClasses": ["STANDARD_IA"], "manifestFilePath": "s3://bucketmanifest/", "accountId": "12345678901", "roleArn": "arn:aws:iam::123456789012:user/example-username"})

Creating

__init__

__init__(sparkContext)

  • sparkContext – The Apache Spark context to use.

getSource

getSource(connection_type, transformation_ctx = "", **options)

Creates a DataSource object that can be used to read DynamicFrames from external sources.

  • connection_type – The connection type to use, such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, and JDBC. Valid values include s3, mysql, postgresql, redshift, sqlserver, oracle, and dynamodb.

  • transformation_ctx – The transformation context to use (optional).

  • options – A collection of optional name-value pairs. For more information, see Connection Types and Options for ETL in AWS Glue.

The following is an example of using getSource.

>>> data_source = context.getSource("file", paths=["/in/path"]) >>> data_source.setFormat("json") >>> myFrame = data_source.getFrame()

create_dynamic_frame_from_rdd

create_dynamic_frame_from_rdd(data, name, schema=None, sample_ratio=None, transformation_ctx="")

Returns a DynamicFrame that is created from an Apache Spark Resilient Distributed Dataset (RDD).

  • data – The data source to use.

  • name – The name of the data to use.

  • schema – The schema to use (optional).

  • sample_ratio – The sample ratio to use (optional).

  • transformation_ctx – The transformation context to use (optional).

create_dynamic_frame_from_catalog

create_dynamic_frame_from_catalog(database, table_name, redshift_tmp_dir, transformation_ctx = "", push_down_predicate= "", additional_options = {}, catalog_id = None)

Returns a DynamicFrame that is created using a catalog database and table name.

  • Database – The database to read from.

  • table_name – The name of the table to read from.

  • redshift_tmp_dir – An Amazon Redshift temporary directory to use (optional).

  • transformation_ctx – The transformation context to use (optional).

  • push_down_predicate – Filters partitions without having to list and read all the files in your dataset. For more information, see Pre-Filtering Using Pushdown Predicates.

  • additional_options – Additional options provided to AWS Glue.

  • catalog_id — The catalog ID (account ID) of the Data Catalog being accessed. When None, the default account ID of the caller is used.

create_dynamic_frame_from_options

create_dynamic_frame_from_options(connection_type, connection_options={}, format=None, format_options={}, transformation_ctx = "")

Returns a DynamicFrame created with the specified connection and format.

  • connection_type – The connection type, such as Amazon S3, Amazon Redshift, and JDBC. Valid values include s3, mysql, postgresql, redshift, sqlserver, oracle, and dynamodb.

  • connection_options – Connection options, such as paths and database table (optional). For a connection_type of s3, a list of Amazon S3 paths is defined.

    connection_options = {"paths": ["s3://aws-glue-target/temp"]}

    For JDBC connections, several properties must be defined. Note that the database name must be part of the URL. It can optionally be included in the connection options.

    connection_options = {"url": "jdbc-url/database", "user": "username", "password": "password","dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path"}

    The dbtable property is the name of the JDBC table. For JDBC data stores that support schemas within a database, specify schema.table-name. If a schema is not provided, then the default "public" schema is used.

    For more information, see Connection Types and Options for ETL in AWS Glue.

  • format – A format specification (optional). This is used for an Amazon S3 or an AWS Glue connection that supports multiple formats. See Format Options for ETL Inputs and Outputs in AWS Glue for the formats that are supported.

  • format_options – Format options for the specified format. See Format Options for ETL Inputs and Outputs in AWS Glue for the formats that are supported.

  • transformation_ctx – The transformation context to use (optional).

Writing

getSink

getSink(connection_type, format = None, transformation_ctx = "", **options)

Gets a DataSink object that can be used to write DynamicFrames to external sources. Check the SparkSQL format first to be sure to get the expected sink.

  • connection_type – The connection type to use, such as Amazon S3, Amazon Redshift, and JDBC. Valid values include s3, mysql, postgresql, redshift, sqlserver, and oracle.

  • format – The SparkSQL format to use (optional).

  • transformation_ctx – The transformation context to use (optional).

  • options – A collection of option name-value pairs.

For example:

>>> data_sink = context.getSink("s3") >>> data_sink.setFormat("json"), >>> data_sink.writeFrame(myFrame)

write_dynamic_frame_from_options

write_dynamic_frame_from_options(frame, connection_type, connection_options={}, format=None, format_options={}, transformation_ctx = "")

Writes and returns a DynamicFrame using the specified connection and format.

  • frame – The DynamicFrame to write.

  • connection_type – The connection type, such as Amazon S3, Amazon Redshift, and JDBC. Valid values include s3, mysql, postgresql, redshift, sqlserver, and oracle.

  • connection_options – Connection options, such as path and database table (optional). For a connection_type of s3, an Amazon S3 path is defined.

    connection_options = {"path": "s3://aws-glue-target/temp"}

    For JDBC connections, several properties must be defined. Note that the database name must be part of the URL. It can optionally be included in the connection options.

    connection_options = {"url": "jdbc-url/database", "user": "username", "password": "password","dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path"}

    The dbtable property is the name of the JDBC table. For JDBC data stores that support schemas within a database, specify schema.table-name. If a schema is not provided, then the default "public" schema is used.

    For more information, see Connection Types and Options for ETL in AWS Glue.

  • format – A format specification (optional). This is used for an Amazon S3 or an AWS Glue connection that supports multiple formats. See Format Options for ETL Inputs and Outputs in AWS Glue for the formats that are supported.

  • format_options – Format options for the specified format. See Format Options for ETL Inputs and Outputs in AWS Glue for the formats that are supported.

  • transformation_ctx – A transformation context to use (optional).

write_from_options

write_from_options(frame_or_dfc, connection_type, connection_options={}, format={}, format_options={}, transformation_ctx = "")

Writes and returns a DynamicFrame or DynamicFrameCollection that is created with the specified connection and format information.

  • frame_or_dfc – The DynamicFrame or DynamicFrameCollection to write.

  • connection_type – The connection type, such as Amazon S3, Amazon Redshift, and JDBC. Valid values include s3, mysql, postgresql, redshift, sqlserver, and oracle.

  • connection_options – Connection options, such as path and database table (optional). For a connection_type of s3, an Amazon S3 path is defined.

    connection_options = {"path": "s3://aws-glue-target/temp"}

    For JDBC connections, several properties must be defined. Note that the database name must be part of the URL. It can optionally be included in the connection options.

    connection_options = {"url": "jdbc-url/database", "user": "username", "password": "password","dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path"}

    The dbtable property is the name of the JDBC table. For JDBC data stores that support schemas within a database, specify schema.table-name. If a schema is not provided, then the default "public" schema is used.

    For more information, see Connection Types and Options for ETL in AWS Glue.

  • format – A format specification (optional). This is used for an Amazon S3 or an AWS Glue connection that supports multiple formats. See Format Options for ETL Inputs and Outputs in AWS Glue for the formats that are supported.

  • format_options – Format options for the specified format. See Format Options for ETL Inputs and Outputs in AWS Glue for the formats that are supported.

  • transformation_ctx – A transformation context to use (optional).

write_dynamic_frame_from_catalog

write_dynamic_frame_from_catalog(frame, database, table_name, redshift_tmp_dir, transformation_ctx = "", addtional_options = {}, catalog_id = None)

Writes and returns a DynamicFrame using a catalog database and a table name.

  • frame – The DynamicFrame to write.

  • Database – The database to read from.

  • table_name – The name of the table to read from.

  • redshift_tmp_dir – An Amazon Redshift temporary directory to use (optional).

  • transformation_ctx – The transformation context to use (optional).

  • catalog_id — The catalog ID (account ID) of the Data Catalog being accessed. When None, the default account ID of the caller is used.

write_dynamic_frame_from_jdbc_conf

write_dynamic_frame_from_jdbc_conf(frame, catalog_connection, connection_options={}, redshift_tmp_dir = "", transformation_ctx = "", catalog_id = None)

Writes and returns a DynamicFrame using the specified JDBC connection information.

  • frame – The DynamicFrame to write.

  • catalog_connection – A catalog connection to use.

  • connection_options – Connection options, such as path and database table (optional). For more information, see Connection Types and Options for ETL in AWS Glue.

  • redshift_tmp_dir – An Amazon Redshift temporary directory to use (optional).

  • transformation_ctx – A transformation context to use (optional).

  • catalog_id — The catalog ID (account ID) of the Data Catalog being accessed. When None, the default account ID of the caller is used.

write_from_jdbc_conf

write_from_jdbc_conf(frame_or_dfc, catalog_connection, connection_options={}, redshift_tmp_dir = "", transformation_ctx = "", catalog_id = None)

Writes and returns a DynamicFrame or DynamicFrameCollection using the specified JDBC connection information.

  • frame_or_dfc – The DynamicFrame or DynamicFrameCollection to write.

  • catalog_connection – A catalog connection to use.

  • connection_options – Connection options, such as path and database table (optional). For more information, see Connection Types and Options for ETL in AWS Glue.

  • redshift_tmp_dir – An Amazon Redshift temporary directory to use (optional).

  • transformation_ctx – A transformation context to use (optional).

  • catalog_id — The catalog ID (account ID) of the Data Catalog being accessed. When None, the default account ID of the caller is used.

Extracting

extract_jdbc_conf

extract_jdbc_conf(connection_name, catalog_id = None)

Returns a dict with keys user, password, vendor, and url from the connection object in the Data Catalog.

  • connection_name – The name of the connection in the Data Catalog

  • catalog_id — The catalog ID (account ID) of the Data Catalog being accessed. When None, the default account ID of the caller is used.