Using the Ion format in AWS Glue - AWS Glue

Using the Ion format in AWS Glue

AWS Glue retrieves data from sources and writes data to targets stored and transported in various data formats. If your data is stored or transported in the Ion data format, this document introduces you available features for using your data in AWS Glue.

AWS Glue supports using the Ion format. This format represents data structures (that aren't row or column based) in interchangeable binary and plaintext representations. For an introduction to the format by the authors, see Amazon Ion. (For more information, see the Amazon Ion Specification.)

You can use AWS Glue to read Ion files from Amazon S3. You can read bzip and gzip archives containing Ion files from S3. You configure compression behavior on the S3 connection parameters instead of in the configuration discussed on this page.

The following table shows which common AWS Glue operations support the Ion format option.

Read Write Streaming read Group small files Job bookmarks
Supported Unsupported Unsupported Supported Unsupported

Example: Read Ion files and folders from S3

Prerequisites: You will need the S3 paths (s3path) to the Ion files or folders that you want to read.

Configuration: In your function options, specify format="json". In your connection_options, use the paths key to specify your s3path. You can configure how the reader interacts with S3 in the connection_options. For details, see Connection types and options for ETL in AWS Glue: Amazon S3 connection option reference.

The following AWS Glue ETL script shows the process of reading Ion files or folders from S3:

Python

For this example, use the create_dynamic_frame.from_options method.

# Example: Read ION from S3 from pyspark.context import SparkContext from awsglue.context import GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) dynamicFrame = glueContext.create_dynamic_frame.from_options( connection_type="s3", connection_options={"paths": ["s3://s3path"]}, format="ion" )
Scala

For this example, use the getSourceWithFormat operation.

// Example: Read ION from S3 import com.amazonaws.services.glue.util.JsonOptions import com.amazonaws.services.glue.GlueContext import org.apache.spark.SparkContext object GlueApp { def main(sysArgs: Array[String]): Unit = { val spark: SparkContext = new SparkContext() val glueContext: GlueContext = new GlueContext(spark) val dynamicFrame = glueContext.getSourceWithFormat( connectionType="s3", format="ion", options=JsonOptions("""{"paths": ["s3://s3path"], "recurse": true}""") ).getDynamicFrame() } }

Ion configuration reference

There are no format_options values for format="ion".