Format Options for ETL Inputs and Outputs in AWS Glue
Various AWS Glue PySpark and Scala methods and transforms specify their input and/or
output
format using a format
parameter and a format_options
parameter. These
parameters can take the following values.
Currently, the only formats that streaming ETL jobs support are JSON, CSV, Parquet, ORC, Avro, and Grok.
format="avro"
This value designates the Apache Avro
You can use the following format_options
values with
format="avro"
:
-
version
— Specifies the version of Apache Avro reader/writer format to support. The default is "1.7". You can specifyformat_options={"version": “1.8”}
to enable Avro logical type reading and writing. For more information, see the Apache Avro 1.7.7 Specificationand Apache Avro 1.8.2 Specification . The Apache Avro 1.8 connector supports the following logical type conversions:
For the reader: this table shows the conversion between Avro data type (logical type
and Avro
primitive type) and AWS Glue DynamicFrame
data type for Avro reader 1.7 and
1.8.
Avro Data Type:
Logical Type |
Avro Data Type:
Avro Primitive Type |
GlueDynamicFrame Data Type:
Avro Reader 1.7 |
GlueDynamicFrame Data Type:
Avro Reader 1.8 |
---|---|---|---|
Decimal | bytes | BINARY | Decimal |
Decimal | fixed | BINARY | Decimal |
Date | int | INT | Date |
Time (millisecond) | int | INT | INT |
Time (microsecond) | long | LONG | LONG |
Timestamp (millisecond) | long | LONG | Timestamp |
Timestamp (microsecond) | long | LONG | LONG |
Duration (not a logical type) | fixed of 12 | BINARY | BINARY |
For the writer: this table shows the conversion between AWS Glue DynamicFrame
data
type and Avro data type for Avro writer 1.7 and 1.8.
AWS Glue DynamicFrame Data Type
|
Avro Data Type:
Avro Writer 1.7 |
Avro Data Type:
Avro Writer 1.8 |
---|---|---|
Decimal | String | decimal |
Date | String | date |
Timestamp | String | timestamp-micros |
format="csv"
This value designates comma-separated-values
as the data format (for example,
see RFC 4180
You can use the following format_options
values with
format="csv"
:
-
separator
— Specifies the delimiter character. The default is a comma:","
, but any other character can be specified. -
escaper
— Specifies a character to use for escaping. The default value isnone
. If enabled, the character which immediately follows is used as-is, except for a small set of well-known escapes (\n
,\r
,\t
, and\0
). -
quoteChar
— Specifies the character to use for quoting. The default is a double quote:'"'
. Set this to-1
to disable quoting entirely. -
multiline
— A Boolean value that specifies whether a single record can span multiple lines. This can occur when a field contains a quoted new-line character. You must set this option toTrue
if any record spans multiple lines. The default value isFalse
, which allows for more aggressive file-splitting during parsing. -
withHeader
— A Boolean value that specifies whether to treat the first line as a header. The default value isFalse
. This option can be used in theDynamicFrameReader
class. -
writeHeader
— A Boolean value that specifies whether to write the header to output. The default value isTrue
. This option can be used in theDynamicFrameWriter
class. -
skipFirst
— A Boolean value that specifies whether to skip the first data line. The default value isFalse
.
The following example shows how to specify the format options within an AWS Glue ETL job script.
glueContext.write_dynamic_frame.from_options( frame = datasource1, connection_type = "s3", connection_options = { "path": "s3://s3path" }, format = "csv", format_options={ "quoteChar": -1, "separator": "|" }, transformation_ctx = "datasink2")
format="ion"
This value designates Amazon Ion
Currently, AWS Glue does not support ion
for output.
There are no format_options
values for format="ion"
.
format="grokLog"
This value designates a log data format specified by one or more Logstash Grok patterns
(for example, see Logstash
Reference (6.2]: Grok filter plugin
Currently, AWS Glue does not support groklog
for output.
You can use the following format_options
values with
format="grokLog"
:
-
logFormat
— Specifies the Grok pattern that matches the log's format. -
customPatterns
— Specifies additional Grok patterns used here. -
MISSING
— Specifies the signal to use in identifying missing values. The default is'-'
. -
LineCount
— Specifies the number of lines in each log record. The default is'1'
, and currently only single-line records are supported. -
StrictMode
— A Boolean value that specifies whether strict mode is enabled. In strict mode, the reader doesn't do automatic type conversion or recovery. The default value is"false"
.
format="json"
This value designates a JSON
You can use the following format_options
values with
format="json"
:
-
jsonPath
— A JsonPathexpression that identifies an object to be read into records. This is particularly useful when a file contains records nested inside an outer array. For example, the following JsonPath expression targets the id
field of a JSON object.format="json", format_options={"jsonPath": "$.id"}
-
multiline
— A Boolean value that specifies whether a single record can span multiple lines. This can occur when a field contains a quoted new-line character. You must set this option to"true"
if any record spans multiple lines. The default value is"false"
, which allows for more aggressive file-splitting during parsing.
format="orc"
This value designates Apache ORC
There are no format_options
values for format="orc"
. However,
any options that are accepted by the underlying SparkSQL code can be passed to it
by way of
the connection_options
map parameter.
format="parquet"
This value designates Apache
Parquet
There are no format_options
values for format="parquet"
.
However, any options that are accepted by the underlying SparkSQL code can be passed
to it by
way of the connection_options
map parameter.
format="glueparquet"
This value designates a custom Parquet writer type that is optimized for Dynamic Frames
as
the data format. A precomputed schema is not required before writing. As data comes
in,
glueparquet
computes and modifies the schema dynamically.
You can use the following format_options
values with
format="glueparquet"
:
-
compression
— Specifies the compression codec used when writing Parquet files. The compression codec used with theglueparquet
format is fully compatible withorg.apache.parquet.hadoop.metadata.CompressionCodecName *
, which includes support for"uncompressed"
,"snappy"
,"gzip"
and"lzo"
. The default value is"snappy"
. -
blockSize
— Specifies the size of a row group being buffered in memory. The default value is"128MB"
. -
pageSize
— Specifies the size of the smallest unit that must be read fully to access a single record. The default value is"1MB"
.
Limitations:
-
glueparquet
supports only a schema shrinkage or expansion, but not a type change. -
glueparquet
is not able to store a schema-only file. -
glueparquet
can only be passed as a format for data sinks.
format="xml"
This value designates XML as the data format, parsed through a fork of the XML Data Source for Apache Spark
Currently, AWS Glue does not support "xml" for output.
You can use the following format_options
values with
format="xml"
:
-
rowTag
— Specifies the XML tag in the file to treat as a row. Row tags cannot be self-closing. -
encoding
— Specifies the character encoding. The default value is"UTF-8"
. -
excludeAttribute
— A Boolean value that specifies whether you want to exclude attributes in elements or not. The default value is"false"
. -
treatEmptyValuesAsNulls
— A Boolean value that specifies whether to treat white space as a null value. The default value is"false"
. -
attributePrefix
— A prefix for attributes to differentiate them from elements. This prefix is used for field names. The default value is"_"
. -
valueTag
— The tag used for a value when there are attributes in the element that have no child. The default is"_VALUE"
. -
ignoreSurroundingSpaces
— A Boolean value that specifies whether the white space that surrounds values should be ignored. The default value is"false"
.