Open CSV SerDe for processing CSV - Amazon Athena

Open CSV SerDe for processing CSV

Use the Open CSV SerDe to create Athena tables from comma-separated data (CSV) data.

Serialization library name

The serialization library name for the Open CSV SerDe is org.apache.hadoop.hive.serde2.OpenCSVSerde. For source code information, see CSV SerDe in the Apache documentation.

Using the Open CSV SerDe

To use this SerDe, specify its fully qualified class name after ROW FORMAT SERDE. Also specify the delimiters inside SERDEPROPERTIES, as in the following example.

... ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = ",", "quoteChar" = "`", "escapeChar" = "\\" )

Ignore headers

To ignore headers in your data when you define a table, you can use the skip.header.line.count table property, as in the following example.

TBLPROPERTIES ("skip.header.line.count"="1")

For examples, see the CREATE TABLE statements in Querying Amazon VPC flow logs and Querying Amazon CloudFront logs.

Considerations for string data

The Open CSV SerDe has the following characteristics for string data:

  • Uses double quotes (") as the default quote character, and allows you to specify separator, quote, and escape characters, such as:

    WITH SERDEPROPERTIES ("separatorChar" = ",", "quoteChar" = "`", "escapeChar" = "\\" )
  • You cannot escape \t or \n directly. To escape them, use "escapeChar" = "\\". For an example, see Example: Escaping \t or \n.

  • The Open CSV SerDe does not support embedded line breaks in CSV files.

Considerations for non-string data

For data types other than STRING, the Open CSV SerDe behaves as follows:

  • Recognizes BOOLEAN, BIGINT, INT, and DOUBLE data types.

  • Does not recognize empty or null values in columns defined as a numeric data type, leaving them as string. One workaround is to create the column with the null values as string and then use CAST to convert the field in a query to a numeric data type, supplying a default value of 0 for nulls. For more information, see When I query CSV data in Athena, I get the error HIVE_BAD_DATA: Error parsing field value in the AWS Knowledge Center.

  • For columns specified with the timestamp data type in your CREATE TABLE statement, recognizes TIMESTAMP data if it is specified in the UNIX numeric format in milliseconds, such as 1579059880000. For an example, see Example: Using the TIMESTAMP type and DATE type specified in the UNIX numeric format.

    • The Open CSV SerDe does not support TIMESTAMP in the JDBC-compliant java.sql.Timestamp format, such as "YYYY-MM-DD HH:MM:SS.fffffffff" (9 decimal place precision).

  • For columns specified with the DATE data type in your CREATE TABLE statement, recognizes values as dates if the values represent the number of days that elapsed since January 1, 1970. For example, the value 18276 in a column with the date data type renders as 2020-01-15 when queried. In this UNIX format, each day is considered to have 86,400 seconds.

  • To further convert columns to the desired type in a table, you can create a view over the table and use CAST to convert to the desired type.

Examples

Example: Querying simple CSV data

The following example assumes you have CSV data saved in the location s3://amzn-s3-demo-bucket/mycsv/ with the following contents:

"a1","a2","a3","a4" "1","2","abc","def" "a","a1","abc3","ab4"

Use a CREATE TABLE statement to create an Athena table based on the data. Reference OpenCSVSerde (note the "d" in lower case) after ROW FORMAT SERDE and specify the character separator, quote character, and escape character in WITH SERDEPROPERTIES, as in the following example.

CREATE EXTERNAL TABLE myopencsvtable ( col1 string, col2 string, col3 string, col4 string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( 'separatorChar' = ',', 'quoteChar' = '"', 'escapeChar' = '\\' ) STORED AS TEXTFILE LOCATION 's3://amzn-s3-demo-bucket/mycsv/';

Query all values in the table:

SELECT * FROM myopencsvtable;

The query returns the following values:

col1 col2 col3 col4 ----------------------------- a1 a2 a3 a4 1 2 abc def a a1 abc3 ab4
Example: Using the TIMESTAMP type and DATE type specified in the UNIX numeric format

Consider the following three columns of comma-separated data. The values in each column are enclosed in double quotes.

"unixvalue creationdate 18276 creationdatetime 1579059880000","18276","1579059880000"

The following statement creates a table in Athena from the specified Amazon S3 bucket location.

CREATE EXTERNAL TABLE IF NOT EXISTS testtimestamp1( `profile_id` string, `creationdate` date, `creationdatetime` timestamp ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION 's3://amzn-s3-demo-bucket'

Next, run the following query:

SELECT * FROM testtimestamp1

The query returns the following result, showing the date and time data:

profile_id creationdate creationdatetime unixvalue creationdate 18276 creationdatetime 1579146280000 2020-01-15 2020-01-15 03:44:40.000
Example: Escaping \t or \n

Consider the following test data:

" \\t\\t\\n 123 \\t\\t\\n ",abc " 456 ",xyz

The following statement creates a table in Athena, specifying that "escapeChar" = "\\".

CREATE EXTERNAL TABLE test1 ( f1 string, s2 string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ("separatorChar" = ",", "escapeChar" = "\\") LOCATION 's3://amzn-s3-demo-bucket/dataset/test1/'

Next, run the following query:

SELECT * FROM test1;

It returns this result, correctly escaping \t or \n:

f1 s2 \t\t\n 123 \t\t\n abc 456 xyz