Using the CSV format in AWS Glue - AWS Glue

Using the CSV format in AWS Glue

AWS Glue retrieves data from sources and writes data to targets stored and transported in various data formats. If your data is stored or transported in the CSV data format, this document introduces you available features for using your data in AWS Glue.

AWS Glue supports using the comma-separated value (CSV) format. This format is a minimal, row-based data format. CSVs often don't strictly conform to a standard, but you can refer to RFC 4180 and RFC 7111 for more information.

You can use AWS Glue to read CSVs from Amazon S3 and from streaming sources as well as write CSVs to Amazon S3. You can read and write bzip and gzip archives containing CSV files from S3. You configure compression behavior on the S3 connection parameters instead of in the configuration discussed on this page.

The following table shows which common AWS Glue features support the CSV format option.

Read Write Streaming read Group small files Job bookmarks
Supported Supported Supported Supported Supported

Example: Read CSV files or folders from S3

Prerequisites: You will need the S3 paths (s3path) to the CSV files or folders that you want to read.

Configuration: In your function options, specify format="csv". In your connection_options, use the paths key to specify s3path. You can configure how the reader interacts with S3 in connection_options. For details, see Connection types and options for ETL in AWS Glue: S3 connection parameters. You can configure how the reader interprets CSV files in your format_options. For details, see CSV Configuration Reference.

The following AWS Glue ETL script shows the process of reading CSV files or folders from S3.

We provide a custom CSV reader with performance optimizations for common workflows through the optimizePerformance configuration key. To determine if this reader is right for your workload, see Optimize read performance with vectorized SIMD CSV reader.

Python

For this example, use the create_dynamic_frame.from_options method.

# Example: Read CSV from S3 # For show, we handle a CSV with a header row. Set the withHeader option. # Consider whether optimizePerformance is right for your workflow. from pyspark.context import SparkContext from awsglue.context import GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) spark = glueContext.spark_session dynamicFrame = glueContext.create_dynamic_frame.from_options( connection_type="s3", connection_options={"paths": ["s3://s3path"]}, format="csv", format_options={ "withHeader": True, # "optimizePerformance": True, }, )

You can also use DataFrames in a script (pyspark.sql.DataFrame).

dataFrame = spark.read\ .format("csv")\ .option("header", "true")\ .load("s3://s3path")
Scala

For this example, use the getSourceWithFormat operation.

// Example: Read CSV from S3 // For show, we handle a CSV with a header row. Set the withHeader option. // Consider whether optimizePerformance is right for your workflow. import com.amazonaws.services.glue.util.JsonOptions import com.amazonaws.services.glue.{DynamicFrame, GlueContext} import org.apache.spark.SparkContext object GlueApp { def main(sysArgs: Array[String]): Unit = { val spark: SparkContext = new SparkContext() val glueContext: GlueContext = new GlueContext(spark) val dynamicFrame = glueContext.getSourceWithFormat( formatOptions=JsonOptions("""{"withHeader": true}"""), connectionType="s3", format="csv", options=JsonOptions("""{"paths": ["s3://s3path"], "recurse": true}""") ).getDynamicFrame() } }

You can also use DataFrames in a script (org.apache.spark.sql.DataFrame).

val dataFrame = spark.read .option("header","true") .format("csv") .load("s3://s3path“)

Example: Write CSV files and folders to S3

Prerequisites: You will need an initialized DataFrame (dataFrame) or a DynamicFrame (dynamicFrame). You will also need your expected S3 output path, s3path.

Configuration: In your function options, specify format="csv". In your connection_options, use the paths key to specify s3path. You can configure how the writer interacts with S3 in connection_options. For details, see Connection types and options for ETL in AWS Glue: S3 connection parameters. You can configure how your operation writes the contents of your files in format_options. For details, see CSV Configuration Reference. The following AWS Glue ETL script shows the process of writing CSV files and folders to S3.

Python

For this example, use the write_dynamic_frame.from_options method.

# Example: Write CSV to S3 # For show, customize how we write string type values. Set quoteChar to -1 so our values are not quoted. from pyspark.context import SparkContext from awsglue.context import GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) glueContext.write_dynamic_frame.from_options( frame=dynamicFrame, connection_type="s3", connection_options={"path": "s3://s3path"}, format="csv", format_options={ "quoteChar": -1, }, )

You can also use DataFrames in a script (pyspark.sql.DataFrame).

dataFrame.write\ .format("csv")\ .option("quote", None)\ .mode("append")\ .save("s3://s3path")
Scala

For this example, use the getSinkWithFormat method.

// Example: Write CSV to S3 // For show, customize how we write string type values. Set quoteChar to -1 so our values are not quoted. import com.amazonaws.services.glue.util.JsonOptions import com.amazonaws.services.glue.{DynamicFrame, GlueContext} import org.apache.spark.SparkContext object GlueApp { def main(sysArgs: Array[String]): Unit = { val spark: SparkContext = new SparkContext() val glueContext: GlueContext = new GlueContext(spark) glueContext.getSinkWithFormat( connectionType="s3", options=JsonOptions("""{"path": "s3://s3path"}"""), format="csv" ).writeDynamicFrame(dynamicFrame) } }

You can also use DataFrames in a script (org.apache.spark.sql.DataFrame).

dataFrame.write .format("csv") .option("quote", null) .mode("Append") .save("s3://s3path")

CSV configuration reference

You can use the following format_options wherever AWS Glue libraries specify format="csv":

  • separator –Specifies the delimiter character. The default is a comma, but any other character can be specified.

    • Type: Text, Default: ","

  • escaper – Specifies a character to use for escaping. This option is used only when reading CSV files, not writing. If enabled, the character that immediately follows is used as-is, except for a small set of well-known escapes (\n, \r, \t, and \0).

    • Type: Text, Default: none

  • quoteChar – Specifies the character to use for quoting. The default is a double quote. Set this to -1 to turn off quoting entirely.

    • Type: Text, Default: '"'

  • multiLine – Specifies whether a single record can span multiple lines. This can occur when a field contains a quoted new-line character. You must set this option to True if any record spans multiple lines. Enabling multiLine might decrease performance because it requires more cautious file-splitting while parsing.

    • Type: Boolean, Default: false

  • withHeader – Specifies whether to treat the first line as a header. This option can be used in the DynamicFrameReader class.

    • Type: Boolean, Default: false

  • writeHeader – Specifies whether to write the header to output. This option can be used in the DynamicFrameWriter class.

    • Type: Boolean, Default: true

  • skipFirst – Specifies whether to skip the first data line.

    • Type: Boolean, Default: false

  • optimizePerformance – Specifies whether to use the advanced SIMD CSV reader along with Apache Arrow–based columnar memory formats. Only available in AWS Glue 3.0+.

    • Type: Boolean, Default: false

  • strictCheckForQuoting – When writing CSVs, Glue may add quotes to values it interprets as strings. This is done to prevent ambiguity in what is written out. To save time when deciding what to write, Glue may quote in certain situations where quotes are not necessary. Enabling a strict check will perform a more intensive computation and will only quote when strictly necessary. Only available in AWS Glue 3.0+.

    • Type: Boolean, Default: false

Optimize read performance with vectorized SIMD CSV reader

AWS Glue version 3.0 adds an optimized CSV reader that can significantly speed up overall job performance compared to row-based CSV readers.

The optimized reader:

  • Uses CPU SIMD instructions to read from disk

  • Immediately writes records to memory in a columnar format (Apache Arrow)

  • Divides the records into batches

This saves processing time when records would be batched or converted to a columnar format later on. Some examples are when changing schemas or retrieving data by column.

To use the optimized reader, set "optimizePerformance" to true in the format_options or table property.

glueContext.create_dynamic_frame.from_options( frame = datasource1, connection_type = "s3", connection_options = {"paths": ["s3://s3path"]}, format = "csv", format_options={ "optimizePerformance": True, "separator": "," }, transformation_ctx = "datasink2")
Limitations for the vectorized CSV reader

Note the following limitations of the vectorized CSV reader:

  • It doesn't support the multiLine and escaper format options. It uses the default escaper of double quote char '"'. When these options are set, AWS Glue automatically falls back to using the row-based CSV reader.

  • It doesn't support creating a DynamicFrame with ChoiceType.

  • It doesn't support creating a DynamicFrame with error records.

  • It doesn't support reading CSV files with multibyte characters such as Japanese or Chinese characters.