Format Options for ETL Inputs and Outputs in AWS Glue - AWS Glue

Format Options for ETL Inputs and Outputs in AWS Glue

Various AWS Glue PySpark and Scala methods and transforms specify their input and/or output format using a format parameter and a format_options parameter. These parameters can take the following values.

Note

Currently, the only formats that streaming ETL jobs support are JSON, CSV, Parquet, and ORC.

format="avro"

This value designates the Apache Avro data format.

You can use the following format_options values with format="avro":

  • version — Specifies the version of Apache Avro reader/writer format to support. The default is "1.7". You can specify format_options={"version": “1.8”} to enable Avro logical type reading and writing. For more information, see the Apache Avro 1.7.7 Specification and Apache Avro 1.8.2 Specification.

    The Apache Avro 1.8 connector supports the following logical type conversions:

For the reader: this table shows the conversion between Avro data type (logical type and Avro primitive type) and Glue DynamicFrame data type for Avro reader 1.7 and 1.8.

Avro Data Type:

Logical Type

Avro Data Type:

Avro Primitive Type

GlueDynamicFrame Data Type:

Avro Reader 1.7

GlueDynamicFrame Data Type:

Avro Reader 1.8

Decimal bytes BINARY Decimal
Decimal fixed BINARY Decimal
Date int INT Date
Time (millisecond) int INT INT
Time (microsecond) long LONG LONG
Timestamp (millisecond) long LONG Timestamp
Timestamp (microsecond) long LONG LONG
Duration (not a logical type) fixed of 12 BINARY BINARY

For the writer: this table shows the conversion between Glue DynamicFrame data type and Avro data type for Avro writer 1.7 and 1.8.

Glue DynamicFrame Data Type Avro Data Type:

Avro Writer 1.7

Avro Data Type:

Avro Writer 1.8

Decimal String decimal
Date String date
Timestamp String timestamp-micros

format="csv"

This value designates comma-separated-values as the data format (for example, see RFC 4180 and RFC 7111).

You can use the following format_options values with format="csv":

  • separator — Specifies the delimiter character. The default is a comma: ',', but any other character can be specified.

  • escaper — Specifies a character to use for escaping. The default value is "none". If enabled, the character which immediately follows is used as-is, except for a small set of well-known escapes (\n, \r, \t, and \0).

  • quoteChar — Specifies the character to use for quoting. The default is a double quote: '"'. Set this to '-1' to disable quoting entirely.

  • multiline — A Boolean value that specifies whether a single record can span multiple lines. This can occur when a field contains a quoted new-line character. You must set this option to "true" if any record spans multiple lines. The default value is "false", which allows for more aggressive file-splitting during parsing.

  • withHeader — A Boolean value that specifies whether to treat the first line as a header. The default value is "false". This option can be used in the DynamicFrameReader class.

  • writeHeader — A Boolean value that specifies whether to write the header to output. The default value is "true". This option can be used in the DynamicFrameWriter class.

  • skipFirst — A Boolean value that specifies whether to skip the first data line. The default value is "false".

format="ion"

This value designates Amazon Ion as the data format. (For more information, see the Amazon Ion Specification.)

Currently, AWS Glue does not support ion for output.

There are no format_options values for format="ion".

format="grokLog"

This value designates a log data format specified by one or more Logstash grok patterns (for example, see Logstash Reference (6.2]: Grok filter plugin).

Currently, AWS Glue does not support groklog for output.

You can use the following format_options values with format="grokLog":

  • logFormat — Specifies the grok pattern that matches the log's format.

  • customPatterns — Specifies additional grok patterns used here.

  • MISSING — Specifies the signal to use in identifying missing values. The default is '-'.

  • LineCount — Specifies the number of lines in each log record. The default is '1', and currently only single-line records are supported.

  • StrictMode — A Boolean value that specifies whether strict mode is enabled. In strict mode, the reader doesn't do automatic type conversion or recovery. The default value is "false".

format="json"

This value designates a JSON (JavaScript Object Notation) data format.

Currently, AWS Glue does not support format_options for json output.

You can use the following format_options values with format="json":

  • jsonPath — A JsonPath expression that identifies an object to be read into records. This is particularly useful when a file contains records nested inside an outer array. For example, the following JsonPath expression targets the id field of a JSON object.

    format="json", format_options={"jsonPath": "$.id"}
  • multiline — A Boolean value that specifies whether a single record can span multiple lines. This can occur when a field contains a quoted new-line character. You must set this option to "true" if any record spans multiple lines. The default value is "false", which allows for more aggressive file-splitting during parsing.

format="orc"

This value designates Apache ORC as the data format. (For more information, see the LanguageManual ORC.)

There are no format_options values for format="orc". However, any options that are accepted by the underlying SparkSQL code can be passed to it by way of the connection_options map parameter.

format="parquet"

This value designates Apache Parquet as the data format.

There are no format_options values for format="parquet". However, any options that are accepted by the underlying SparkSQL code can be passed to it by way of the connection_options map parameter.

format="glueparquet"

This value designates a custom Parquet writer type that is optimized for Dynamic Frames as the data format. A precomputed schema is not required before writing. As data comes in, glueparquet computes and modifies the schema dynamically.

You can use the following format_options values with format="glueparquet":

  • compression — Specifies the compression codec used when writing Parquet files. The compression codec used with the glueparquet format is fully compatible with org.apache.parquet.hadoop.metadata.CompressionCodecName *, which includes support for "uncompressed", "snappy", "gzip" and "lzo". The default value is "snappy".

  • blockSize — Specifies the size of a row group being buffered in memory. The default value is "128MB".

  • pageSize — Specifies the size of the smallest unit that must be read fully to access a single record. The default value is "1MB".

Limitations:

  • glueparquet supports only a schema shrinkage or expansion, but not a type change.

  • glueparquet is not able to store a schema-only file.

  • glueparquet can only be passed as a format for data sinks.

format="xml"

This value designates XML as the data format, parsed through a fork of the XML Data Source for Apache Spark parser.

Currently, AWS Glue does not support "xml" for output.

You can use the following format_options values with format="xml":

  • rowTag — Specifies the XML tag in the file to treat as a row. Row tags cannot be self-closing.

  • encoding — Specifies the character encoding. The default value is "UTF-8".

  • excludeAttribute — A Boolean value that specifies whether you want to exclude attributes in elements or not. The default value is "false".

  • treatEmptyValuesAsNulls — A Boolean value that specifies whether to treat white space as a null value. The default value is "false".

  • attributePrefix — A prefix for attributes to differentiate them from elements. This prefix is used for field names. The default value is "_".

  • valueTag — The tag used for a value when there are attributes in the element that have no child. The default is "_VALUE".

  • ignoreSurroundingSpaces — A Boolean value that specifies whether the white space that surrounds values should be ignored. The default value is "false".