Using the Avro format in AWS Glue - AWS Glue

Using the Avro format in AWS Glue

AWS Glue retrieves data from sources and writes data to targets stored and transported in various data formats. If your data is stored or transported in the Avro data format, this document introduces you available features for using your data in AWS Glue.

AWS Glue supports using the Avro format. This format is a performance-oriented, row-based data format. For an introduction to the format by the standard authority see, Apache Avro 1.8.2 Documentation.

You can use AWS Glue to read Avro files from Amazon S3 and from streaming sources as well as write Avro files to Amazon S3. You can read and write bzip2 and gzip archives containing Avro files from S3. Additionally, you can write deflate, snappy, and xz archives containing Avro files. You configure compression behavior on the S3 connection parameters instead of in the configuration discussed on this page.

The following table shows which common AWS Glue operations support the Avro format option.

Read Write Streaming read Group small files Job bookmarks
Supported Supported Supported* Unsupported Supported

*Supported with restrictions. For more information, see Notes and restrictions for Avro streaming sources.

Example: Read Avro files or folders from S3

Prerequisites: You will need the S3 paths (s3path) to the Avro files or folders that you want to read.

Configuration: In your function options, specify format="avro". In your connection_options, use the paths key to specify s3path. You can configure how the reader interacts with S3 in the connection_options. For details, see Data format options for ETL inputs and outputs in AWS Glue: Amazon S3 connection option reference. You can configure how the reader interprets Avro files in your format_options. For details, see Avro Configuration Reference.

The following AWS Glue ETL script shows the process of reading Avro files or folders from S3:

Python

For this example, use the create_dynamic_frame.from_options method.

from pyspark.context import SparkContext from awsglue.context import GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) dynamicFrame = glueContext.create_dynamic_frame.from_options( connection_type="s3", connection_options={"paths": ["s3://s3path"]}, format="avro" )
Scala

For this example, use the getSourceWithFormat operation.

import com.amazonaws.services.glue.util.JsonOptions import com.amazonaws.services.glue.GlueContext import org.apache.spark.sql.SparkContext object GlueApp { def main(sysArgs: Array[String]): Unit = { val spark: SparkContext = new SparkContext() val glueContext: GlueContext = new GlueContext(spark) val dynamicFrame = glueContext.getSourceWithFormat( connectionType="s3", format="avro", options=JsonOptions("""{"paths": ["s3://s3path"]}""") ).getDynamicFrame() }

Example: Write Avro files and folders to S3

Prerequisites: You will need an initialized DataFrame (dataFrame) or DynamicFrame (dynamicFrame). You will also need your expected S3 output path, s3path.

Configuration: In your function options, specify format="avro". In your connection_options, use the paths key to specify your s3path. You can further alter how the writer interacts with S3 in the connection_options. For details, see Data format options for ETL inputs and outputs in AWS Glue: Amazon S3 connection option reference. You can alter how the writer interprets Avro files in your format_options. For details, see Avro Configuration Reference.

The following AWS Glue ETL script shows the process of writing Avro files or folders to S3.

Python

For this example, use the write_dynamic_frame.from_options method.

from pyspark.context import SparkContext from awsglue.context import GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) glueContext.write_dynamic_frame.from_options( frame=dynamicFrame, connection_type="s3", format="avro", connection_options={ "path": "s3://s3path" } )
Scala

For this example, use the getSinkWithFormat method.

import com.amazonaws.services.glue.util.JsonOptions import com.amazonaws.services.glue.{DynamicFrame, GlueContext} import org.apache.spark.SparkContext object GlueApp { def main(sysArgs: Array[String]): Unit = { val spark: SparkContext = new SparkContext() val glueContext: GlueContext = new GlueContext(spark) glueContext.getSinkWithFormat( connectionType="s3", options=JsonOptions("""{"path": "s3://s3path"}"""), format="avro" ).writeDynamicFrame(dynamicFrame) } }

Avro configuration reference

You can use the following format_options values wherever AWS Glue libraries specify format="avro":

  • version — Specifies the version of Apache Avro reader/writer format to support. The default is "1.7". You can specify format_options={"version": “1.8”} to enable Avro logical type reading and writing. For more information, see the Apache Avro 1.7.7 Specification and Apache Avro 1.8.2 Specification.

    The Apache Avro 1.8 connector supports the following logical type conversions:

For the reader: this table shows the conversion between Avro data type (logical type and Avro primitive type) and AWS Glue DynamicFrame data type for Avro reader 1.7 and 1.8.

Avro Data Type:

Logical Type

Avro Data Type:

Avro Primitive Type

GlueDynamicFrame Data Type:

Avro Reader 1.7

GlueDynamicFrame Data Type:

Avro Reader 1.8

Decimal bytes BINARY Decimal
Decimal fixed BINARY Decimal
Date int INT Date
Time (millisecond) int INT INT
Time (microsecond) long LONG LONG
Timestamp (millisecond) long LONG Timestamp
Timestamp (microsecond) long LONG LONG
Duration (not a logical type) fixed of 12 BINARY BINARY

For the writer: this table shows the conversion between AWS Glue DynamicFrame data type and Avro data type for Avro writer 1.7 and 1.8.

AWS Glue DynamicFrame Data Type Avro Data Type:

Avro Writer 1.7

Avro Data Type:

Avro Writer 1.8

Decimal String decimal
Date String date
Timestamp String timestamp-micros

Avro Spark DataFrame support

In order to use Avro from the Spark DataFrame API, you need to install the Spark Avro plugin for the corresponding Spark version. The version of Spark available in your job is determined by your AWS Glue version. For more information about Spark versions, see AWS Glue versions. This plugin is maintained by Apache, we do not make specific guarantees of support.

In AWS Glue 2.0 - use version 2.4.3 of the Spark Avro plugin. You can find this JAR on Maven Central, see org.apache.spark:spark-avro_2.12:2.4.3.

In AWS Glue 3.0 - use version 3.1.1 of the Spark Avro plugin. You can find this JAR on Maven Central, see org.apache.spark:spark-avro_2.12:3.1.1.

To include extra JARs in a AWS Glue ETL job, use the --extra-jars job parameter. For more information about job parameters, see AWS Glue job parameters. You can also configure this parameter in the AWS Management Console.