Built-In Transforms
AWS Glue provides a set of built-in transforms that you can use to process your data. You
can call these transforms from your ETL script. Your data passes from transform to
transform in a data structure called a DynamicFrame, which is an
extension to an Apache Spark SQL DataFrame
. The DynamicFrame
contains your data, and you reference its schema to process your data. For more
information about these transforms, see AWS Glue PySpark Transforms Reference.
AWS Glue provides the following built-in transforms:
- ApplyMapping
-
Maps source columns and data types from a
DynamicFrame
to target columns and data types in a returnedDynamicFrame
. You specify the mapping argument, which is a list of tuples that contain source column, source type, target column, and target type. - DropFields
-
Removes a field from a
DynamicFrame
. The outputDynamicFrame
contains fewer fields than the input. You specify which fields to remove using thepaths
argument. Thepaths
argument points to a field in the schema tree structure using dot notation. For example, to remove field B, which is a child of field A in the tree, typeA.B
for the path. - DropNullFields
-
Removes null fields from a
DynamicFrame
. The outputDynamicFrame
does not contain fields of the null type in the schema. - Filter
-
Selects records from a
DynamicFrame
and returns a filteredDynamicFrame
. You specify a function, such as a Lambda function, which determines whether a record is output (function returns true) or not (function returns false). - Join
-
Equijoin of two
DynamicFrames
. You specify the key fields in the schema of each frame to compare for equality. The outputDynamicFrame
contains rows where keys match. - Map
-
Applies a function to the records of a
DynamicFrame
and returns a transformedDynamicFrame
. The supplied function is applied to each input record and transforms it to an output record. The map transform can add fields, delete fields, and perform lookups using an external API operation. If there is an exception, processing continues, and the record is marked as an error. - MapToCollection
-
Applies a transform to each
DynamicFrame
in aDynamicFrameCollection
. - Relationalize
-
Converts a
DynamicFrame
to a relational (rows and columns) form. Based on the data's schema, this transform flattens nested structures and createsDynamicFrames
from arrays structures. The output is a collection ofDynamicFrames
that can result in data written to multiple tables. - RenameField
-
Renames a field in a
DynamicFrame
. The output is aDynamicFrame
with the specified field renamed. You provide the new name and the path in the schema to the field to be renamed. - ResolveChoice
-
Use
ResolveChoice
to specify how a column should be handled when it contains values of multiple types. You can choose to either cast the column to a single data type, discard one or more of the types, or retain all types in either separate columns or a structure. You can select a different resolution policy for each column or specify a global policy that is applied to all columns. - SelectFields
-
Selects fields from a
DynamicFrame
to keep. The output is aDynamicFrame
with only the selected fields. You provide the paths in the schema to the fields to keep. - SelectFromCollection
-
Selects one
DynamicFrame
from a collection ofDynamicFrames
. The output is the selectedDynamicFrame
. You provide an index to theDynamicFrame
to select. - Spigot
-
Writes sample data from a
DynamicFrame
. Output is a JSON file in Amazon S3. You specify the Amazon S3 location and how to sample theDynamicFrame
. Sampling can be a specified number of records from the beginning of the file or a probability factor used to pick records to write. - SplitFields
-
Splits fields into two
DynamicFrames
. Output is a collection ofDynamicFrames
: one with selected fields, and one with the remaining fields. You provide the paths in the schema to the selected fields. - SplitRows
-
Splits rows in a
DynamicFrame
based on a predicate. The output is a collection of twoDynamicFrames
: one with selected rows, and one with the remaining rows. You provide the comparison based on fields in the schema. For example,A > 4
. - Unbox
-
Unboxes a string field from a
DynamicFrame
. The output is aDynamicFrame
with the selected string field reformatted. The string field can be parsed and replaced with several fields. You provide a path in the schema for the string field to reformat and its current format type. For example, you might have a CSV file that has one field that is in JSON format{"a": 3, "b": "foo", "c": 1.2}
. This transform can reformat the JSON into three fields: anint
, astring
, and adouble
.
For examples of using these transforms in a job script, see the AWS blog Building an AWS Glue ETL pipeline locally without an AWS
account