DynamicFrameReader class - AWS Glue

DynamicFrameReader class

 — methods —

__init__

__init__(glue_context)

from_rdd

from_rdd(data, name, schema=None, sampleRatio=None)

Reads a DynamicFrame from a Resilient Distributed Dataset (RDD).

  • data – The dataset to read from.

  • name – The name to read from.

  • schema – The schema to read (optional).

  • sampleRatio – The sample ratio (optional).

from_options

from_options(connection_type, connection_options={}, format=None, format_options={}, transformation_ctx="")

Reads a DynamicFrame using the specified connection and format.

  • connection_type – The connection type. Valid values include s3, mysql, postgresql, redshift, sqlserver, oracle, and dynamodb.

  • connection_options – Connection options, such as path and database table (optional). For a connection_type of s3, Amazon S3 paths are defined in an array.

    connection_options = {"paths": [ "s3://mybucket/object_a", "s3://mybucket/object_b"]}

    For JDBC connections, several properties must be defined. Note that the database name must be part of the URL. It can optionally be included in the connection options.

    Warning

    Storing passwords in your script is not recommended. Consider using boto3 to retrieve them from AWS Secrets Manager or the AWS Glue Data Catalog.

    connection_options = {"url": "jdbc-url/database", "user": "username", "password": passwordVariable,"dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path"}

    For a JDBC connection that performs parallel reads, you can set the hashfield option. For example:

    connection_options = {"url": "jdbc-url/database", "user": "username", "password": passwordVariable,"dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path" , "hashfield": "month"}

    For more information, see Reading from JDBC tables in parallel.

  • format – A format specification (optional). This is used for an Amazon Simple Storage Service (Amazon S3) or an AWS Glue connection that supports multiple formats. See Data format options for inputs and outputs in AWS Glue for Spark for the formats that are supported.

  • format_options – Format options for the specified format. See Data format options for inputs and outputs in AWS Glue for Spark for the formats that are supported.

  • transformation_ctx – The transformation context to use (optional).

  • push_down_predicate – Filters partitions without having to list and read all the files in your dataset. For more information, see Pre-Filtering Using Pushdown Predicates.

from_catalog

from_catalog(database, table_name, redshift_tmp_dir="", transformation_ctx="", push_down_predicate="", additional_options={})

Reads a DynamicFrame using the specified catalog namespace and table name.

  • database – The database to read from.

  • table_name – The name of the table to read from.

  • redshift_tmp_dir – An Amazon Redshift temporary directory to use (optional if not reading data from Redshift).

  • transformation_ctx – The transformation context to use (optional).

  • push_down_predicate – Filters partitions without having to list and read all the files in your dataset. For more information, see Pre-filtering using pushdown predicates.

  • additional_options – Additional options provided to AWS Glue.

    • To use a JDBC connection that performs parallel reads, you can set the hashfield, hashexpression, or hashpartitions options. For example:

      additional_options = {"hashfield": "month"}

      For more information, see Reading from JDBC tables in parallel.

    • To pass a catalog expression to filter based on the index columns, you can see the catalogPartitionPredicate option.

      catalogPartitionPredicate — You can pass a catalog expression to filter based on the index columns. This pushes down the filtering to the server side. For more information, see AWS Glue Partition Indexes. Note that push_down_predicate and catalogPartitionPredicate use different syntaxes. The former one uses Spark SQL standard syntax and the later one uses JSQL parser.

      For more information, see Managing partitions for ETL output in AWS Glue.

    • To read from Lake Formation governed tables, you can use these additional options:

      • transactionId – (String) The transaction ID at which to read the Governed table contents. If this transaction is not committed, the read will be treated as part of that transaction and will see its writes. If this transaction is committed, its writes will be visible in this read. If this transaction has aborted, an error will be returned. Cannot be specified along with asOfTime.

        Note

        Either transactionId or asOfTime must set to access the governed table.

      • asOfTime – (TimeStamp: yyyy-[m]m-[d]d hh:mm:ss) The time as of when to read the table contents. Cannot be specified along with transactionId.

      • query – (Optional) A PartiQL query statement used as an input to the Lake Formation planner service. If not set, the default setting is to select all data from the table. For more details about PartiQL, see PartiQL Support in Row Filter Expressions in the AWS Lake Formation Developer Guide.

      Example: Using a PartiQL query statement when reading from a governed table in Lake Formation
      txId = glueContext.start_transaction(read_only=False) datasource0 = glueContext.create_dynamic_frame.from_catalog( database = db, table_name = tbl, transformation_ctx = "datasource0", additional_options={ "transactionId":txId, "query":"SELECT * from tblName WHERE partitionKey=value;" }) ... glueContext.commit_transaction(txId)