Format Options for ETL Inputs and Outputs in AWS Glue
Various AWS Glue PySpark and Scala methods and transforms specify their input and/or output
format using a format
parameter and a format_options
parameter. These
parameters can take the following values.
Currently, the only formats that streaming ETL jobs support are JSON, CSV, Parquet, ORC, Avro, and Grok.
For jobs that access AWS Lake Formation governed tables, AWS Glue supports reading and writing all formats supported by Lake Formation governed tables. For the current list of supported formats for AWS Lake Formation governed tables, see Notes and Restrictions for Governed Tables in the AWS Lake Formation Developer Guide.
For writing Apache Parquet, AWS Glue ETL only supports writing to a governed table by specifying an option for a custom Parquet writer type optimized for Dynamic Frames. When writing to a governed table with the parquet
format, you should add the key useGlueParquetWriter
with a value of true
in the table parameters.
format="avro"
This value designates the Apache Avro
You can use the following format_options
values with
format="avro"
:
version
— Specifies the version of Apache Avro reader/writer format to support. The default is "1.7". You can specifyformat_options={"version": “1.8”}
to enable Avro logical type reading and writing. For more information, see the Apache Avro 1.7.7 Specificationand Apache Avro 1.8.2 Specification . The Apache Avro 1.8 connector supports the following logical type conversions:
For the reader: this table shows the conversion between Avro data type (logical type and Avro
primitive type) and AWS Glue DynamicFrame
data type for Avro reader 1.7 and
1.8.
Avro Data Type: Logical Type |
Avro Data Type: Avro Primitive Type |
GlueDynamicFrame Data Type: Avro Reader 1.7 |
GlueDynamicFrame Data Type: Avro Reader 1.8 |
---|---|---|---|
Decimal | bytes | BINARY | Decimal |
Decimal | fixed | BINARY | Decimal |
Date | int | INT | Date |
Time (millisecond) | int | INT | INT |
Time (microsecond) | long | LONG | LONG |
Timestamp (millisecond) | long | LONG | Timestamp |
Timestamp (microsecond) | long | LONG | LONG |
Duration (not a logical type) | fixed of 12 | BINARY | BINARY |
For the writer: this table shows the conversion between AWS Glue DynamicFrame
data
type and Avro data type for Avro writer 1.7 and 1.8.
AWS Glue DynamicFrame Data Type |
Avro Data Type: Avro Writer 1.7 |
Avro Data Type: Avro Writer 1.8 |
---|---|---|
Decimal | String | decimal |
Date | String | date |
Timestamp | String | timestamp-micros |
format="csv"
This value designates comma-separated-values
as the data format (for example,
see RFC 4180
You can use the following format_options
values with
format="csv"
:
separator
— Specifies the delimiter character. The default is a comma:","
, but any other character can be specified.escaper
— Specifies a character to use for escaping. This option is used only when reading CSV files. The default value isnone
. If enabled, the character which immediately follows is used as-is, except for a small set of well-known escapes (\n
,\r
,\t
, and\0
).quoteChar
— Specifies the character to use for quoting. The default is a double quote:'"'
. Set this to-1
to turn off quoting entirely.multiLine
— A Boolean value that specifies whether a single record can span multiple lines. This can occur when a field contains a quoted new-line character. You must set this option toTrue
if any record spans multiple lines. The default value isFalse
, which allows for more aggressive file-splitting during parsing.withHeader
— A Boolean value that specifies whether to treat the first line as a header. The default value isFalse
. This option can be used in theDynamicFrameReader
class.writeHeader
— A Boolean value that specifies whether to write the header to output. The default value isTrue
. This option can be used in theDynamicFrameWriter
class.skipFirst
— A Boolean value that specifies whether to skip the first data line. The default value isFalse
.optimizePerformance
— A Boolean value that specifies whether to use the advanced SIMD CSV reader along with Apache Arrow based columnar memory formats. Only available in AWS Glue 3.0.
The following example shows how to specify the format options within an AWS Glue ETL job script.
glueContext.write_dynamic_frame.from_options( frame = datasource1, connection_type = "s3", connection_options = { "path": "s3://s3path" }, format = "csv", format_options={ "quoteChar": -1, "separator": "|" }, transformation_ctx = "datasink2")
Using vectorized SIMD CSV reader with Apache Arrow columnar format
AWS Glue version 3.0 adds the support for using Apache Arrow as the in-memory columnar format, which allows for batching records together in columnar fashion. Batching reduces CPU bottlenecks related to handling changing schema across record batches and increases retrieval efficiency of data from column buffers. AWS Glue vectorizes the CSV readers with the use of CPU SIMD instruction sets and microparallel algorithms. With these optimizations, AWS Glue version 3.0 achieves a significant performance speedup compared to using row-based AWS Glue DynamicFrame or Spark DataFrames.
To use the optimized reader with Arrow format, set "optimizePerformance" to true in the format_options
or table property.
glueContext.create_dynamic_frame.from_options( frame = datasource1, connection_type = "s3", connection_options = {"paths": ["s3://s3path"]}, format = "csv", format_options={ "optimizePerformance": True, "separator": "," }, transformation_ctx = "datasink2")
Limitations for the vectorized CSV reader
Note the following limitations:
It doesn't support the
multiLine
andescaper
format options. It uses the defaultescaper
of double quote char '"'. When these options are set, AWS Glue automatically falls back to using the row-based CSV reader.It doesn't support creating a DynamicFrame with ChoiceType.
It doesn't support creating a DynamicFrame with error records.
It doesn't support reading CSV files with multiByte characters such as Japanese or Chinese characters.
format="ion"
This value designates Amazon Ion
Currently, AWS Glue does not support ion
for output.
There are no format_options
values for format="ion"
.
format="grokLog"
This value designates a log data format specified by one or more Logstash Grok patterns
(for example, see Logstash
Reference (6.2]: Grok filter plugin
Currently, AWS Glue does not support groklog
for output.
You can use the following format_options
values with
format="grokLog"
:
logFormat
— Specifies the Grok pattern that matches the log's format.customPatterns
— Specifies additional Grok patterns used here.MISSING
— Specifies the signal to use in identifying missing values. The default is'-'
.LineCount
— Specifies the number of lines in each log record. The default is'1'
, and currently only single-line records are supported.StrictMode
— A Boolean value that specifies whether strict mode is turned on. In strict mode, the reader doesn't do automatic type conversion or recovery. The default value is"false"
.
format="json"
This value designates a JSON
You can use the following format_options
values with
format="json"
:
-
jsonPath
— A JsonPathexpression that identifies an object to be read into records. This is particularly useful when a file contains records nested inside an outer array. For example, the following JsonPath expression targets the id
field of a JSON object.format="json", format_options={"jsonPath": "$.id"}
multiLine
— A Boolean value that specifies whether a single record can span multiple lines. This can occur when a field contains a quoted new-line character. You must set this option to"true"
if any record spans multiple lines. The default value is"false"
, which allows for more aggressive file-splitting during parsing.
format="orc"
This value designates Apache ORC
There are no format_options
values for format="orc"
. However,
any options that are accepted by the underlying SparkSQL code can be passed to it by way of
the connection_options
map parameter.
format="parquet"
This value designates Apache
ParquetuseGlueParquetWriter
option the writer computes and modifies the
schema dynamically, as data comes in.
You can use the following format_options
values:
-
useGlueParquetWriter
— Specifies the use of a custom Parquet writer type that is optimized for Dynamic Frames. The output format is Parquet.
See the following example, where sink
is the object returned by
glue_context.getSink()
.
sink.setFormat("parquet", useGlueParquetWriter=True)
Limitations when specifying useGlueParquetWriter
:
-
The writer supports only schema evolution, such as adding or removing columns, but not changing column types, such as with
ResolveChoice
. -
The writer is not able to store a schema-only file.
-
The option can only be passed as a format for data targets.
Additionally, any options that are accepted by the underlying SparkSQL code can be passed
to this format by way of the connection_options
map parameter. For example, a Spark configuration such as mergeSchema
compression
— Specifies the compression codec used when writing Parquet files. The compression codec used with theglueparquet
format is fully compatible withorg.apache.parquet.hadoop.metadata.CompressionCodecName *
, which includes support for"uncompressed"
,"snappy"
,"gzip"
and"lzo"
. The default value is"snappy"
.blockSize
— Specifies the size in bytes of a row group being buffered in memory in megabytes. The default value is134217728
, or 128 MB.pageSize
— Specifies the size in bytes of the smallest unit that must be read fully to access a single record. The default value is1048576
, or 1 MB.
format="xml"
This value designates XML as the data format, parsed through a fork of the XML Data Source for Apache Spark
Currently, AWS Glue does not support "xml" for output.
You can use the following format_options
values with
format="xml"
:
rowTag
— Specifies the XML tag in the file to treat as a row. Row tags cannot be self-closing.encoding
— Specifies the character encoding. The default value is"UTF-8"
.excludeAttribute
— A Boolean value that specifies whether you want to exclude attributes in elements or not. The default value is"false"
.treatEmptyValuesAsNulls
— A Boolean value that specifies whether to treat white space as a null value. The default value is"false"
.attributePrefix
— A prefix for attributes to differentiate them from elements. This prefix is used for field names. The default value is"_"
.valueTag
— The tag used for a value when there are attributes in the element that have no child. The default is"_VALUE"
.ignoreSurroundingSpaces
— A Boolean value that specifies whether the white space that surrounds values should be ignored. The default value is"false"
.-
withSchema
— A string value that contains the expected schema. If you do not use this option, AWS Glue infers the schema from the XML data.
Example
This is an example of using the withSchema
format option to specify the
schema for XML data.
schema = StructType([ Field("id", IntegerType()), Field("name", StringType()), Field("nested", StructType([ Field("x", IntegerType()), Field("y", StringType()), Field("z", ChoiceType([IntegerType(), StringType()])) ])) ]) datasource0 = create_dynamic_frame_from_options( connection_type, connection_options={"paths": ["s3://xml_bucket/someprefix"]}, format="xml", format_options={"withSchema": json.dumps(schema.jsonValue())}, transformation_ctx = "" )