Using the JSON format in AWS Glue - AWS Glue

Using the JSON format in AWS Glue

AWS Glue retrieves data from sources and writes data to targets stored and transported in various data formats. If your data is stored or transported in the JSON data format, this document introduces you to available features for using your data in AWS Glue.

AWS Glue supports using the JSON format. This format represents data structures with consistent shape but flexible contents, that aren't row or column based. JSON is defined by parallel standards issued by several authorities, one of which is ECMA-404. For an introduction to the format by a commonly referenced source, see Introducing JSON.

You can use AWS Glue to read JSON files from Amazon S3, as well as bzip and gzip compressed JSON files. You configure compression behavior on the S3 connection parameters instead of in the configuration discussed on this page.

Read Write Streaming read Group small files Job bookmarks
Supported Supported Supported Supported Supported

Example: Read JSON files or folders from S3

Prerequisites: You will need the S3 paths (s3path) to the JSON files or folders you would like to read.

Configuration: In your function options, specify format="json". In your connection_options, use the paths key to specify your s3path. You can further alter how your read operation will traverse s3 in the connection options, consult Amazon S3 connection option reference for details. You can configure how the reader interprets JSON files in your format_options. For details, see JSON Configuration Reference.

The following AWS Glue ETL script shows the process of reading JSON files or folders from S3:

Python

For this example, use the create_dynamic_frame.from_options method.

# Example: Read JSON from S3 # For show, we handle a nested JSON file that we can limit with the JsonPath parameter # For show, we also handle a JSON where a single entry spans multiple lines # Consider whether optimizePerformance is right for your workflow. from pyspark.context import SparkContext from awsglue.context import GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) spark = glueContext.spark_session dynamicFrame = glueContext.create_dynamic_frame.from_options( connection_type="s3", connection_options={"paths": ["s3://s3path"]}, format="json", format_options={ "jsonPath": "$.id", "multiline": True, # "optimizePerformance": True, -> not compatible with jsonPath, multiline } )

You can also use DataFrames in a script (pyspark.sql.DataFrame).

dataFrame = spark.read\ .option("multiLine", "true")\ .json("s3://s3path")
Scala

For this example, use the getSourceWithFormat operation.

// Example: Read JSON from S3 // For show, we handle a nested JSON file that we can limit with the JsonPath parameter // For show, we also handle a JSON where a single entry spans multiple lines // Consider whether optimizePerformance is right for your workflow. import com.amazonaws.services.glue.util.JsonOptions import com.amazonaws.services.glue.{DynamicFrame, GlueContext} import org.apache.spark.SparkContext object GlueApp { def main(sysArgs: Array[String]): Unit = { val spark: SparkContext = new SparkContext() val glueContext: GlueContext = new GlueContext(spark) val dynamicFrame = glueContext.getSourceWithFormat( formatOptions=JsonOptions("""{"jsonPath": "$.id", "multiline": true, "optimizePerformance":false}"""), connectionType="s3", format="json", options=JsonOptions("""{"paths": ["s3://s3path"], "recurse": true}""") ).getDynamicFrame() } }

You can also use DataFrames in a script (pyspark.sql.DataFrame).

val dataFrame = spark.read .option("multiLine", "true") .json("s3://s3path")

Example: Write JSON files and folders to S3

Prerequisites:You will need an initialized DataFrame (dataFrame) or DynamicFrame (dynamicFrame). You will also need your expected S3 output path, s3path.

Configuration: In your function options, specify format="json". In your connection_options, use the paths key to specify s3path. You can further alter how the writer interacts with S3 in the connection_options. For details, see Data format options for ETL inputs and outputs in AWS Glue : Amazon S3 connection option reference. You can configure how the writer interprets JSON files in your format_options. For details, see JSON Configuration Reference.

The following AWS Glue ETL script shows the process of writing JSON files or folders from S3:

Python

For this example, use the write_dynamic_frame.from_options method.

# Example: Write JSON to S3 from pyspark.context import SparkContext from awsglue.context import GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) glueContext.write_dynamic_frame.from_options( frame=dynamicFrame, connection_type="s3", connection_options={"path": "s3://s3path"}, format="json" )

You can also use DataFrames in a script (pyspark.sql.DataFrame).

df.write.json("s3://s3path/")
Scala

For this example, use the getSinkWithFormat method.

// Example: Write JSON to S3 import com.amazonaws.services.glue.util.JsonOptions import com.amazonaws.services.glue.{DynamicFrame, GlueContext} import org.apache.spark.SparkContext object GlueApp { def main(sysArgs: Array[String]): Unit = { val spark: SparkContext = new SparkContext() val glueContext: GlueContext = new GlueContext(spark) glueContext.getSinkWithFormat( connectionType="s3", options=JsonOptions("""{"path": "s3://s3path"}"""), format="json" ).writeDynamicFrame(dynamicFrame) } }

You can also use DataFrames in a script (pyspark.sql.DataFrame).

df.write.json("s3://s3path")

Json configuration reference

You can use the following format_options values with format="json":

  • jsonPath — A JsonPath expression that identifies an object to be read into records. This is particularly useful when a file contains records nested inside an outer array. For example, the following JsonPath expression targets the id field of a JSON object.

    format="json", format_options={"jsonPath": "$.id"}
  • multiLine — A Boolean value that specifies whether a single record can span multiple lines. This can occur when a field contains a quoted new-line character. You must set this option to "true" if any record spans multiple lines. The default value is "false", which allows for more aggressive file-splitting during parsing.

  • optimizePerformance — A Boolean value that specifies whether to use the advanced SIMD JSON reader along with Apache Arrow based columnar memory formats. Only available in AWS Glue 3.0. Not compatible with multiLine or jsonPath. Providing either of those options will instruct AWS Glue to fall back to the standard reader.

  • withSchema — A String value that specifies a table schema in the format described in Manually specify the XML schema. Only used with optimizePerformance when reading from non-Catalog connections.

Using vectorized SIMD JSON reader with Apache Arrow columnar format

AWS Glue version 3.0 adds a vectorized reader for JSON data. It performs 2x faster under certain conditions, compared to the standard reader. This reader comes with certain limitations users should be aware of before use, documented in this section.

To use the optimized reader, set "optimizePerformance" to True in the format_options or table property. You will also need to provide withSchema unless reading from the catalog. withSchema expects an input as described in the Manually specify the XML schema

// Read from S3 data source glueContext.create_dynamic_frame.from_options( connection_type = "s3", connection_options = {"paths": ["s3://s3path"]}, format = "json", format_options={ "optimizePerformance": True, "withSchema": SchemaString }) // Read from catalog table glueContext.create_dynamic_frame.from_catalog( database = database, table_name = table, additional_options = { // The vectorized reader for JSON can read your schema from a catalog table property. "optimizePerformance": True, })

For more information about the building a SchemaString in the AWS Glue library, see PySpark extension types.

Limitations for the vectorized CSV reader

Note the following limitations:

  • JSON elements with nested objects or array values are not supported. If provided, AWS Glue will fall back to the standard reader.

  • A schema must be provided, either from the Catalog or with withSchema.

  • Not compatible with multiLine or jsonPath. Providing either of those options will instruct AWS Glue to fall back to the standard reader.

  • Providing input records that do not match the input schema will cause the reader to fail.

  • Error records will not be created.

  • JSON files with multi-byte characters (such as Japanese or Chinese characters) are not supported.