GlueContext Class
Wraps the Apache Spark SparkContext
Working with Datasets in Amazon S3
purge_table
purge_table(database, table_name, options={}, transformation_ctx="", catalog_id=None)
Deletes files from Amazon S3 for the specified catalog's database and table. If all files in a partition are deleted, that partition is also deleted from the catalog.
If you want to be able to recover deleted objects, you can enable object
versioning on the Amazon S3 bucket. When an object is deleted from a bucket that
doesn't have object versioning enabled, the object can't be recovered. For more information
about how to recover deleted objects in a version-enabled bucket, see How can I retrieve an Amazon S3 object that was deleted?
-
database
– The database to use. -
table_name
– The name of the table to use. -
options
– Options to filter files to be deleted and for manifest file generation.-
retentionPeriod
– Specifies a period in number of hours to retain files. Files newer than the retention period are retained. Set to 168 hours (7 days) by default. -
partitionPredicate
– Partitions satisfying this predicate are deleted. Files within the retention period in these partitions are not deleted. Set to""
– empty by default. -
excludeStorageClasses
– Files with storage class in theexcludeStorageClasses
set are not deleted. The default isSet()
– an empty set. -
manifestFilePath
– An optional path for manifest file generation. All files that were successfully purged are recorded inSuccess.csv
, and those that failed inFailed.csv
-
-
transformation_ctx
– The transformation context to use (optional). Used in the manifest file path. -
catalog_id
– The catalog ID of the Data Catalog being accessed (the account ID of the Data Catalog). Set toNone
by default.None
defaults to the catalog ID of the calling account in the service.
glueContext.purge_table("database", "table", {"partitionPredicate": "(month=='march')", "retentionPeriod": 1, "excludeStorageClasses": ["STANDARD_IA"], "manifestFilePath": "s3://bucketmanifest/"})
purge_s3_path
purge_s3_path(s3_path, options={}, transformation_ctx="")
Deletes files from the specified Amazon S3 path recursively.
If you want to be able to recover deleted objects, you can enable object
versioning on the Amazon S3 bucket. When an object is deleted from a bucket that
doesn't have object versioning enabled, the object can't be recovered. For more information
about how to recover deleted objects in a version-enabled bucket, see How can I retrieve an Amazon S3 object that was deleted?
-
s3_path
– The path in Amazon S3 of the files to be deleted in the formats3://<
bucket
>/<prefix
>/ -
options
– Options to filter files to be deleted and for manifest file generation.-
retentionPeriod
– Specifies a period in number of hours to retain files. Files newer than the retention period are retained. Set to 168 hours (7 days) by default. -
partitionPredicate
– Partitions satisfying this predicate are deleted. Files within the retention period in these partitions are not deleted. Set to""
– empty by default. -
excludeStorageClasses
– Files with storage class in theexcludeStorageClasses
set are not deleted. The default isSet()
– an empty set. -
manifestFilePath
– An optional path for manifest file generation. All files that were successfully purged are recorded inSuccess.csv
, and those that failed inFailed.csv
-
-
transformation_ctx
– The transformation context to use (optional). Used in the manifest file path. -
catalog_id
– The catalog ID of the Data Catalog being accessed (the account ID of the Data Catalog). Set toNone
by default.None
defaults to the catalog ID of the calling account in the service.
glueContext.purge_s3_path("s3://bucket/path/", {"retentionPeriod": 1, "excludeStorageClasses": ["STANDARD_IA"], "manifestFilePath": "s3://bucketmanifest/"})
transition_table
transition_table(database, table_name, transition_to, options={}, transformation_ctx="", catalog_id=None)
Transitions the storage class of the files stored on Amazon S3 for the specified catalog's database and table.
You can transition between any two storage classes. For the GLACIER
and DEEP_ARCHIVE
storage classes, you can transition to these classes. However, you would use an S3 RESTORE
to transition from GLACIER
and DEEP_ARCHIVE
storage classes.
If you're running AWS Glue ETL jobs that read files or partitions from Amazon S3, you can exclude some Amazon S3 storage class types. For more information, see Excluding Amazon S3 Storage Classes.
-
database
– The database to use. -
table_name
– The name of the table to use. -
transition_to
– The Amazon S3 storage class to transition to. -
options
– Options to filter files to be deleted and for manifest file generation.-
retentionPeriod
– Specifies a period in number of hours to retain files. Files newer than the retention period are retained. Set to 168 hours (7 days) by default. -
partitionPredicate
– Partitions satisfying this predicate are transitioned. Files within the retention period in these partitions are not transitioned. Set to""
– empty by default. -
excludeStorageClasses
– Files with storage class in theexcludeStorageClasses
set are not transitioned. The default isSet()
– an empty set. -
manifestFilePath
– An optional path for manifest file generation. All files that were successfully transitioned are recorded inSuccess.csv
, and those that failed inFailed.csv
-
accountId
– The AWS account ID to run the transition transform. Mandatory for this transform. -
roleArn
– The AWS role to run the transition transform. Mandatory for this transform.
-
-
transformation_ctx
– The transformation context to use (optional). Used in the manifest file path. -
catalog_id
– The catalog ID of the Data Catalog being accessed (the account ID of the Data Catalog). Set toNone
by default.None
defaults to the catalog ID of the calling account in the service.
glueContext.transition_table("database", "table", "STANDARD_IA", {"retentionPeriod": 1, "excludeStorageClasses": ["STANDARD_IA"], "manifestFilePath": "s3://bucketmanifest/", "accountId": "12345678901", "roleArn": "arn:aws:iam::123456789012:user/example-username"})
transition_s3_path
transition_s3_path(s3_path, transition_to, options={}, transformation_ctx="")
Transitions the storage class of the files in the specified Amazon S3 path recursively.
You can transition between any two storage classes. For the GLACIER
and DEEP_ARCHIVE
storage classes, you can transition to these classes. However, you would use an S3 RESTORE
to transition from GLACIER
and DEEP_ARCHIVE
storage classes.
If you're running AWS Glue ETL jobs that read files or partitions from Amazon S3, you can exclude some Amazon S3 storage class types. For more information, see Excluding Amazon S3 Storage Classes.
-
s3_path
– The path in Amazon S3 of the files to be transitioned in the formats3://<
bucket
>/<prefix
>/ -
transition_to
– The Amazon S3 storage class to transition to. -
options
– Options to filter files to be deleted and for manifest file generation.-
retentionPeriod
– Specifies a period in number of hours to retain files. Files newer than the retention period are retained. Set to 168 hours (7 days) by default. -
partitionPredicate
– Partitions satisfying this predicate are transitioned. Files within the retention period in these partitions are not transitioned. Set to""
– empty by default. -
excludeStorageClasses
– Files with storage class in theexcludeStorageClasses
set are not transitioned. The default isSet()
– an empty set. -
manifestFilePath
– An optional path for manifest file generation. All files that were successfully transitioned are recorded inSuccess.csv
, and those that failed inFailed.csv
-
accountId
– The AWS account ID to run the transition transform. Mandatory for this transform. -
roleArn
– The AWS role to run the transition transform. Mandatory for this transform.
-
-
transformation_ctx
– The transformation context to use (optional). Used in the manifest file path.
glueContext.transition_s3_path("s3://bucket/prefix/", "STANDARD_IA", {"retentionPeriod": 1, "excludeStorageClasses": ["STANDARD_IA"], "manifestFilePath": "s3://bucketmanifest/", "accountId": "12345678901", "roleArn": "arn:aws:iam::123456789012:user/example-username"})
Creating
__init__
__init__(sparkContext)
-
sparkContext
– The Apache Spark context to use.
getSource
getSource(connection_type, transformation_ctx = "", **options)
Creates a DataSource
object that can be used to read
DynamicFrames
from external sources.
-
connection_type
– The connection type to use, such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, and JDBC. Valid values includes3
,mysql
,postgresql
,redshift
,sqlserver
,oracle
, anddynamodb
. -
transformation_ctx
– The transformation context to use (optional). -
options
– A collection of optional name-value pairs. For more information, see Connection Types and Options for ETL in AWS Glue.
The following is an example of using getSource
.
>>> data_source = context.getSource("file", paths=["/in/path"]) >>> data_source.setFormat("json") >>> myFrame = data_source.getFrame()
create_dynamic_frame_from_rdd
create_dynamic_frame_from_rdd(data, name, schema=None, sample_ratio=None, transformation_ctx="")
Returns a DynamicFrame
that is created from an Apache Spark Resilient Distributed
Dataset (RDD).
-
data
– The data source to use. -
name
– The name of the data to use. -
schema
– The schema to use (optional). -
sample_ratio
– The sample ratio to use (optional). -
transformation_ctx
– The transformation context to use (optional).
create_dynamic_frame_from_catalog
create_dynamic_frame_from_catalog(database, table_name, redshift_tmp_dir, transformation_ctx
= "", push_down_predicate= "", additional_options = {}, catalog_id = None)
Returns a DynamicFrame
that is created using a catalog database and table
name.
-
Database
– The database to read from. -
table_name
– The name of the table to read from. -
redshift_tmp_dir
– An Amazon Redshift temporary directory to use (optional). -
transformation_ctx
– The transformation context to use (optional). -
push_down_predicate
– Filters partitions without having to list and read all the files in your dataset. For more information, see Pre-Filtering Using Pushdown Predicates. -
additional_options
– A collection of optional name-value pairs. The possible options include those listed in Connection Types and Options for ETL in AWS Glue except forendpointUrl
,streamName
,bootstrap.servers
,security.protocol
,topicName
,classification
, anddelimiter
. -
catalog_id
— The catalog ID (account ID) of the Data Catalog being accessed. When None, the default account ID of the caller is used.
create_dynamic_frame_from_options
create_dynamic_frame_from_options(connection_type, connection_options={},
format=None, format_options={}, transformation_ctx = "")
Returns a DynamicFrame
created with the specified connection and
format.
-
connection_type
– The connection type, such as Amazon S3, Amazon Redshift, and JDBC. Valid values includes3
,mysql
,postgresql
,redshift
,sqlserver
,oracle
, anddynamodb
. -
connection_options
– Connection options, such as paths and database table (optional). For aconnection_type
ofs3
, a list of Amazon S3 paths is defined.connection_options = {"paths": ["
s3://aws-glue-target/temp
"]}For JDBC connections, several properties must be defined. Note that the database name must be part of the URL. It can optionally be included in the connection options.
connection_options = {"url": "
jdbc-url/database
", "user": "username
", "password": "password
","dbtable": "table-name
", "redshiftTmpDir": "s3-tempdir-path
"}The
dbtable
property is the name of the JDBC table. For JDBC data stores that support schemas within a database, specifyschema.table-name
. If a schema is not provided, then the default "public" schema is used.For more information, see Connection Types and Options for ETL in AWS Glue.
-
format
– A format specification (optional). This is used for an Amazon S3 or an AWS Glue connection that supports multiple formats. See Format Options for ETL Inputs and Outputs in AWS Glue for the formats that are supported. -
format_options
– Format options for the specified format. See Format Options for ETL Inputs and Outputs in AWS Glue for the formats that are supported. -
transformation_ctx
– The transformation context to use (optional).
add_ingestion_time_columns
add_ingestion_time_columns(dataFrame, timeGranularity = "")
Appends ingestion time columns like ingest_year
, ingest_month
,
ingest_day
, ingest_hour
, ingest_minute
to the input
DataFrame
. This function is automatically generated in the script generated
by the AWS Glue when you specify a Data Catalog table with Amazon S3 as the target.
This
function automatically updates the partition with ingestion time columns on the output
table. This allows the output data to be automatically partitioned on ingestion time
without
requiring explicit ingestion time columns in the input data.
-
dataFrame
– ThedataFrame
to append the ingestion time columns to. -
timeGranularity
– The granularity of the time columns. Valid values are "day
", "hour
" and "minute
". For example, if "hour
" is passed in to the function, the originaldataFrame
will have "ingest_year
", "ingest_month
", "ingest_day
", and "ingest_hour
" time columns appended.
Returns the data frame after appending the time granularity columns.
Example:
dynamic_frame = DynamicFrame.fromDF(glueContext.add_ingestion_time_columns(dataFrame, "hour"))
Writing
getSink
getSink(connection_type, format = None, transformation_ctx = "", **options)
Gets a DataSink
object that can be used to write DynamicFrames
to external sources. Check the SparkSQL format
first to be sure to get the expected sink.
-
connection_type
– The connection type to use, such as Amazon S3, Amazon Redshift, and JDBC. Valid values includes3
,mysql
,postgresql
,redshift
,sqlserver
, andoracle
. -
format
– The SparkSQL format to use (optional). -
transformation_ctx
– The transformation context to use (optional). -
options
– A collection of option name-value pairs.
For example:
>>> data_sink = context.getSink("s3") >>> data_sink.setFormat("json"), >>> data_sink.writeFrame(myFrame)
write_dynamic_frame_from_options
write_dynamic_frame_from_options(frame, connection_type, connection_options={}, format=None,
format_options={}, transformation_ctx = "")
Writes and returns a DynamicFrame
using the specified connection and
format.
-
frame
– TheDynamicFrame
to write. -
connection_type
– The connection type, such as Amazon S3, Amazon Redshift, and JDBC. Valid values includes3
,mysql
,postgresql
,redshift
,sqlserver
, andoracle
. -
connection_options
– Connection options, such as path and database table (optional). For aconnection_type
ofs3
, an Amazon S3 path is defined.connection_options = {"path": "
s3://aws-glue-target/temp
"}For JDBC connections, several properties must be defined. Note that the database name must be part of the URL. It can optionally be included in the connection options.
connection_options = {"url": "
jdbc-url/database
", "user": "username
", "password": "password
","dbtable": "table-name
", "redshiftTmpDir": "s3-tempdir-path
"}The
dbtable
property is the name of the JDBC table. For JDBC data stores that support schemas within a database, specifyschema.table-name
. If a schema is not provided, then the default "public" schema is used.For more information, see Connection Types and Options for ETL in AWS Glue.
-
format
– A format specification (optional). This is used for an Amazon S3 or an AWS Glue connection that supports multiple formats. See Format Options for ETL Inputs and Outputs in AWS Glue for the formats that are supported. -
format_options
– Format options for the specified format. See Format Options for ETL Inputs and Outputs in AWS Glue for the formats that are supported. -
transformation_ctx
– A transformation context to use (optional).
write_from_options
write_from_options(frame_or_dfc, connection_type,
connection_options={}, format={}, format_options={}, transformation_ctx = "")
Writes and returns a DynamicFrame
or DynamicFrameCollection
that is created with the specified connection and format information.
-
frame_or_dfc
– TheDynamicFrame
orDynamicFrameCollection
to write. -
connection_type
– The connection type, such as Amazon S3, Amazon Redshift, and JDBC. Valid values includes3
,mysql
,postgresql
,redshift
,sqlserver
, andoracle
. -
connection_options
– Connection options, such as path and database table (optional). For aconnection_type
ofs3
, an Amazon S3 path is defined.connection_options = {"path": "
s3://aws-glue-target/temp
"}For JDBC connections, several properties must be defined. Note that the database name must be part of the URL. It can optionally be included in the connection options.
connection_options = {"url": "
jdbc-url/database
", "user": "username
", "password": "password
","dbtable": "table-name
", "redshiftTmpDir": "s3-tempdir-path
"}The
dbtable
property is the name of the JDBC table. For JDBC data stores that support schemas within a database, specifyschema.table-name
. If a schema is not provided, then the default "public" schema is used.For more information, see Connection Types and Options for ETL in AWS Glue.
-
format
– A format specification (optional). This is used for an Amazon S3 or an AWS Glue connection that supports multiple formats. See Format Options for ETL Inputs and Outputs in AWS Glue for the formats that are supported. -
format_options
– Format options for the specified format. See Format Options for ETL Inputs and Outputs in AWS Glue for the formats that are supported. -
transformation_ctx
– A transformation context to use (optional).
write_dynamic_frame_from_catalog
write_dynamic_frame_from_catalog(frame, database, table_name, redshift_tmp_dir, transformation_ctx
= "", addtional_options = {}, catalog_id = None)
Writes and returns a DynamicFrame
using a catalog database and a table
name.
-
frame
– TheDynamicFrame
to write. -
Database
– The database to read from. -
table_name
– The name of the table to read from. -
redshift_tmp_dir
– An Amazon Redshift temporary directory to use (optional). -
transformation_ctx
– The transformation context to use (optional). -
catalog_id
— The catalog ID (account ID) of the Data Catalog being accessed. When None, the default account ID of the caller is used.
write_dynamic_frame_from_jdbc_conf
write_dynamic_frame_from_jdbc_conf(frame, catalog_connection, connection_options={},
redshift_tmp_dir = "", transformation_ctx = "", catalog_id = None)
Writes and returns a DynamicFrame
using the specified JDBC connection
information.
-
frame
– TheDynamicFrame
to write. -
catalog_connection
– A catalog connection to use. -
connection_options
– Connection options, such as path and database table (optional). For more information, see Connection Types and Options for ETL in AWS Glue. -
redshift_tmp_dir
– An Amazon Redshift temporary directory to use (optional). -
transformation_ctx
– A transformation context to use (optional). -
catalog_id
— The catalog ID (account ID) of the Data Catalog being accessed. When None, the default account ID of the caller is used.
write_from_jdbc_conf
write_from_jdbc_conf(frame_or_dfc, catalog_connection, connection_options={}, redshift_tmp_dir = "", transformation_ctx = "", catalog_id = None)
Writes and returns a DynamicFrame
or DynamicFrameCollection
using the specified JDBC connection information.
-
frame_or_dfc
– TheDynamicFrame
orDynamicFrameCollection
to write. -
catalog_connection
– A catalog connection to use. -
connection_options
– Connection options, such as path and database table (optional). For more information, see Connection Types and Options for ETL in AWS Glue. -
redshift_tmp_dir
– An Amazon Redshift temporary directory to use (optional). -
transformation_ctx
– A transformation context to use (optional). -
catalog_id
— The catalog ID (account ID) of the Data Catalog being accessed. When None, the default account ID of the caller is used.
Extracting
extract_jdbc_conf
extract_jdbc_conf(connection_name, catalog_id = None)
Returns a dict
with keys user
, password
, vendor
, and url
from the connection object in the Data Catalog.
-
connection_name
– The name of the connection in the Data Catalog -
catalog_id
— The catalog ID (account ID) of the Data Catalog being accessed. When None, the default account ID of the caller is used.