Built-In Transforms - AWS Glue

Built-In Transforms

AWS Glue provides a set of built-in transforms that you can use to process your data. You can call these transforms from your ETL script. Your data passes from transform to transform in a data structure called a DynamicFrame, which is an extension to an Apache Spark SQL DataFrame. The DynamicFrame contains your data, and you reference its schema to process your data. For more information about these transforms, see AWS Glue PySpark Transforms Reference.

AWS Glue provides the following built-in transforms:


Maps source columns and data types from a DynamicFrame to target columns and data types in a returned DynamicFrame. You specify the mapping argument, which is a list of tuples that contain source column, source type, target column, and target type.


Removes a field from a DynamicFrame. The output DynamicFrame contains fewer fields than the input. You specify which fields to remove using the paths argument. The paths argument points to a field in the schema tree structure using dot notation. For example, to remove field B, which is a child of field A in the tree, type A.B for the path.


Removes null fields from a DynamicFrame. The output DynamicFrame does not contain fields of the null type in the schema.


Selects records from a DynamicFrame and returns a filtered DynamicFrame. You specify a function, such as a Lambda function, which determines whether a record is output (function returns true) or not (function returns false).


Equijoin of two DynamicFrames. You specify the key fields in the schema of each frame to compare for equality. The output DynamicFrame contains rows where keys match.


Applies a function to the records of a DynamicFrame and returns a transformed DynamicFrame. The supplied function is applied to each input record and transforms it to an output record. The map transform can add fields, delete fields, and perform lookups using an external API operation. If there is an exception, processing continues, and the record is marked as an error.


Applies a transform to each DynamicFrame in a DynamicFrameCollection.


Converts a DynamicFrame to a relational (rows and columns) form. Based on the data's schema, this transform flattens nested structures and creates DynamicFrames from arrays structures. The output is a collection of DynamicFrames that can result in data written to multiple tables.


Renames a field in a DynamicFrame. The output is a DynamicFrame with the specified field renamed. You provide the new name and the path in the schema to the field to be renamed.


Use ResolveChoice to specify how a column should be handled when it contains values of multiple types. You can choose to either cast the column to a single data type, discard one or more of the types, or retain all types in either separate columns or a structure. You can select a different resolution policy for each column or specify a global policy that is applied to all columns.


Selects fields from a DynamicFrame to keep. The output is a DynamicFrame with only the selected fields. You provide the paths in the schema to the fields to keep.


Selects one DynamicFrame from a collection of DynamicFrames. The output is the selected DynamicFrame. You provide an index to the DynamicFrame to select.


Writes sample data from a DynamicFrame. Output is a JSON file in Amazon S3. You specify the Amazon S3 location and how to sample the DynamicFrame. Sampling can be a specified number of records from the beginning of the file or a probability factor used to pick records to write.


Splits fields into two DynamicFrames. Output is a collection of DynamicFrames: one with selected fields, and one with the remaining fields. You provide the paths in the schema to the selected fields.


Splits rows in a DynamicFrame based on a predicate. The output is a collection of two DynamicFrames: one with selected rows, and one with the remaining rows. You provide the comparison based on fields in the schema. For example, A > 4.


Unboxes a string field from a DynamicFrame. The output is a DynamicFrame with the selected string field reformatted. The string field can be parsed and replaced with several fields. You provide a path in the schema for the string field to reformat and its current format type. For example, you might have a CSV file that has one field that is in JSON format {"a": 3, "b": "foo", "c": 1.2}. This transform can reformat the JSON into three fields: an int, a string, and a double.

For examples of using these transforms in a job script, see the AWS blog Building an AWS Glue ETL pipeline locally without an AWS account.