OpenCSVSerDe for processing CSV
When you create an Athena table for CSV data, determine the SerDe to use based on the types of values your data contains:
-
If your data contains values enclosed in double quotes (
"
), you can use the OpenCSV SerDeto deserialize the values in Athena. If your data does not contain values enclosed in double quotes ( "
), you can omit specifying any SerDe. In this case, Athena uses the defaultLazySimpleSerDe
. For information, see LazySimpleSerDe for CSV, TSV, and custom-delimited files. -
If your data has UNIX numeric
TIMESTAMP
values (for example,1579059880000
), use the OpenCSVSerDe. If your data uses thejava.sql.Timestamp
format, use the LazySimpleSerDe.
CSV SerDe (OpenCSVSerDe)
The OpenCSV
SerDe
-
Uses double quotes (
"
) as the default quote character, and allows you to specify separator, quote, and escape characters, such as:WITH SERDEPROPERTIES ("separatorChar" = ",", "quoteChar" = "`", "escapeChar" = "\\" )
-
Cannot escape
\t
or\n
directly. To escape them, use"escapeChar" = "\\"
. See the example in this topic. -
Does not support embedded line breaks in CSV files.
For data types other than STRING
, the OpenCSVSerDe behaves as
follows:
-
Recognizes
BOOLEAN
,BIGINT
,INT
, andDOUBLE
data types. -
Does not recognize empty or null values in columns defined as a numeric data type, leaving them as
string
. One workaround is to create the column with the null values asstring
and then useCAST
to convert the field in a query to a numeric data type, supplying a default value of0
for nulls. For more information, see When I query CSV data in Athena, I get the error HIVE_BAD_DATA: Error parsing field valuein the AWS Knowledge Center. -
For columns specified with the
timestamp
data type in yourCREATE TABLE
statement, recognizesTIMESTAMP
data if it is specified in the UNIX numeric format in milliseconds, such as1579059880000
.-
The OpenCSVSerDe does not support
TIMESTAMP
in the JDBC-compliantjava.sql.Timestamp
format, such as"YYYY-MM-DD HH:MM:SS.fffffffff"
(9 decimal place precision).
-
-
For columns specified with the
DATE
data type in yourCREATE TABLE
statement, recognizes values as dates if the values represent the number of days that elapsed since January 1, 1970. For example, the value18276
in a column with thedate
data type renders as2020-01-15
when queried. In this UNIX format, each day is considered to have 86,400 seconds.-
The OpenCSVSerDe does not support
DATE
in any other format directly. To process timestamp data in other formats, you can define the column asstring
and then use time conversion functions to return the desired results in yourSELECT
query. For more information, see the article When I query a table in Amazon Athena, the TIMESTAMP result is emptyin the AWS knowledge center .
-
-
To further convert columns to the desired type in a table, you can create a view over the table and use
CAST
to convert to the desired type.
Example: Using the TIMESTAMP type and DATE type specified in the UNIX numeric format.
Consider the following three columns of comma-separated data. The values in each column are enclosed in double quotes.
"unixvalue creationdate 18276 creationdatetime 1579059880000","18276","1579059880000"
The following statement creates a table in Athena from the specified Amazon S3 bucket location.
CREATE EXTERNAL TABLE IF NOT EXISTS testtimestamp1( `profile_id` string, `creationdate` date, `creationdatetime` timestamp ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION 's3://
DOC-EXAMPLE-BUCKET
'
Next, run the following query:
SELECT * FROM testtimestamp1
The query returns the following result, showing the date and time data:
profile_id creationdate creationdatetime unixvalue creationdate 18276 creationdatetime 1579146280000 2020-01-15 2020-01-15 03:44:40.000
Example: Escaping \t
or \n
Consider the following test data:
" \\t\\t\\n 123 \\t\\t\\n ",abc " 456 ",xyz
The following statement creates a table in Athena, specifying that
"escapeChar" = "\\"
.
CREATE EXTERNAL TABLE test1 ( f1 string, s2 string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ("separatorChar" = ",", "escapeChar" = "\\") LOCATION 's3://
DOC-EXAMPLE-BUCKET
/dataset/test1/'
Next, run the following query:
SELECT * FROM test1;
It returns this result, correctly escaping \t
or
\n
:
f1 s2 \t\t\n 123 \t\t\n abc 456 xyz
SerDe name
Library name
To use this SerDe, specify its fully qualified class name after ROW FORMAT
SERDE
. Also specify the delimiters inside SERDEPROPERTIES
,
as follows:
... ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = ",", "quoteChar" = "`", "escapeChar" = "\\" )
Ignoring headers
To ignore headers in your data when you define a table, you can use the
skip.header.line.count
table property, as in the following
example.
TBLPROPERTIES ("skip.header.line.count"="1")
For examples, see the CREATE TABLE
statements in Querying Amazon VPC flow logs and Querying Amazon CloudFront logs.
Example
This example presumes data in CSV saved in
s3://
with
the following contents:DOC-EXAMPLE-BUCKET
/mycsv/
"a1","a2","a3","a4"
"1","2","abc","def"
"a","a1","abc3","ab4"
Use a CREATE TABLE
statement to create an Athena table based on the
data. Reference the OpenCSVSerDe class after ROW FORMAT SERDE
and
specify the character separator, quote character, and escape character in WITH
SERDEPROPERTIES
, as in the following example.
CREATE EXTERNAL TABLE myopencsvtable ( col1 string, col2 string, col3 string, col4 string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( 'separatorChar' = ',', 'quoteChar' = '"', 'escapeChar' = '\\' ) STORED AS TEXTFILE LOCATION 's3://
DOC-EXAMPLE-BUCKET
/mycsv/';
Query all values in the table:
SELECT * FROM myopencsvtable;
The query returns the following values:
col1 col2 col3 col4
-----------------------------
a1 a2 a3 a4
1 2 abc def
a a1 abc3 ab4