OpenCSVSerDe for Processing CSV
When you create a table from CSV data in Athena, determine what types of values it contains:
-
If data contains values enclosed in double quotes (
"
), you can use the OpenCSV SerDeto deserialize the values in Athena. In the following sections, note the behavior of this SerDe with STRING
data types. -
If data does not contain values enclosed in double quotes (
"
), you can omit specifying any SerDe. In this case, Athena uses the defaultLazySimpleSerDe
. For information, see LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files.
CSV SerDe (OpenCSVSerDe)
The OpenCSV
SerDe
-
Converts all column type values to
STRING
. -
To recognize data types other than
STRING
, relies on the Presto parser and converts the values fromSTRING
into those data types if it can recognize them. -
Uses double quotes (
"
) as the default quote character, and allows you to specify separator, quote, and escape characters, such as:WITH SERDEPROPERTIES ("separatorChar" = ",", "quoteChar" = "`", "escapeChar" = "\\" )
-
Cannot escape
\t
or\n
directly. To escape them, use"escapeChar" = "\\"
. See the example in this topic. -
Does not support embedded line breaks in CSV files.
-
Does not support empty fields in columns defined as a numeric data type.
When you use Athena with OpenCSVSerDe, the SerDe converts all column types to
STRING
. Next, the parser in Athena parses the values from
STRING
into actual types based on what it finds. For example, it
parses the values into BOOLEAN
, BIGINT
, INT
,
and DOUBLE
data types when it can discern them. If the values are in
TIMESTAMP
in the UNIX format, Athena parses them as
TIMESTAMP
. If the values are in TIMESTAMP
in Hive
format, Athena parses them as INT
. DATE
type values are
also parsed as INT
.
To further convert columns to the desired type in a table, you can create a view over the table and use CAST
to
convert to the desired type.
For data types other than STRING
, when the parser in
Athena can recognize them, this SerDe behaves as follows:
-
Recognizes
BOOLEAN
,BIGINT
,INT
, andDOUBLE
data types and parses them without changes. The parser does not recognize empty or null values in columns defined as a numeric data type, leaving them as the default data type ofSTRING
. The workaround is to declare the column asSTRING
and thenCAST
it in aSELECT
query or view. -
Recognizes the
TIMESTAMP
type if it is specified in the UNIX numeric format, such as1564610311
. -
Does not support
TIMESTAMP
in the JDBC-compliantjava.sql.Timestamp
format, such as"YYYY-MM-DD HH:MM:SS.fffffffff"
(9 decimal place precision). If you are processing CSV data from Hive, use the UNIX numeric format. -
Recognizes the
DATE
type if it is specified in the UNIX numeric format, such as1562112000
. -
Does not support
DATE
in another format. If you are processing CSV data from Hive, use the UNIX numeric format.
For information about using the TIMESTAMP
and DATE
columns when they are not specified in the UNIX numeric format, see the article
When I query a table in Amazon Athena, the TIMESTAMP result is empty
Example: Using the TIMESTAMP type and DATE type specified in the UNIX numeric format.
Consider the following test data:
"unixvalue creationdate 18276 creationdatetime 1579146280000","18276","1579146280000"
The following statement creates a table in Athena from the specified Amazon S3 bucket location.
CREATE EXTERNAL TABLE IF NOT EXISTS testtimestamp1( `profile_id` string, `creationdate` date, `creationdatetime` timestamp ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION 's3://
<location>
'
Next, run the following query:
select * from testtimestamp1
The query returns the following result, showing the date and time data:
profile_id creationdate creationdatetime 1 unixvalue creationdate 18276 creationdatetime 1579146280000 2020-01-15 2020-01-16 03:44:40.000
Example: Escaping \t
or \n
Consider the following test data:
" \\t\\t\\n 123 \\t\\t\\n ",abc " 456 ",xyz
The following statement creates a table in Athena, specifying that
"escapeChar" = "\\"
.
CREATE EXTERNAL TABLE test1 ( f1 string, s2 string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ("separatorChar" = ",", "escapeChar" = "\\") LOCATION 's3://
user-test-region
/dataset/test1/'
Next, run the following query:
select * from test1;
It returns this result, correctly escaping \t
or
\n
:
f1 s2 \t\t\n 123 \t\t\n abc 456 xyz
SerDe Name
Library Name
To use this SerDe, specify its fully qualified class name in ROW
FORMAT
. Also specify the delimiters inside SERDEPROPERTIES
,
as follows:
... ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = ",", "quoteChar" = "`", "escapeChar" = "\\" )
Ignoring Headers
To ignore headers in your data when you define a table, you can use the skip.header.line.count
table property, as in the following example.
TBLPROPERTIES ("skip.header.line.count"="1")
For examples, see the CREATE TABLE
statements in Querying Amazon VPC Flow Logs and Querying Amazon CloudFront Logs.
Example
This example presumes data in CSV saved in s3://mybucket/mycsv/
with
the following contents:
"a1","a2","a3","a4"
"1","2","abc","def"
"a","a1","abc3","ab4"
Use a CREATE TABLE
statement to create an Athena table based on the
data, and reference the OpenCSVSerDe class in ROW FORMAT
, also
specifying SerDe properties for character separator, quote character, and escape
character, as follows:
CREATE EXTERNAL TABLE myopencsvtable ( col1 string, col2 string, col3 string, col4 string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( 'separatorChar' = ',', 'quoteChar' = '"', 'escapeChar' = '\\' ) STORED AS TEXTFILE LOCATION 's3://location/of/csv/';
Query all values in the table:
SELECT * FROM myopencsvtable;
The query returns the following values:
col1 col2 col3 col4
-----------------------------
a1 a2 a3 a4
1 2 abc def
a a1 abc3 ab4
The flight table data comes from Flights