OpenCSVSerDe for Processing CSV
When you create a table from a CSV file in Athena, determine what types of values it contains:
-
If the file contains values enclosed in quotes, use the OpenCSV SerDe to deserialize the values in Athena.
-
If the file does not contain values enclosed in quotes, you can omit specifying any SerDe. In this case, Athena uses the default
LazySimpleSerDe
. For information, see LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files.
CSV SerDe (OpenCSVSerde)
The OpenCSV SerDe behaves as follows:
-
Allows you to specify separator, quote, and escape characters, such as:
WITH SERDEPROPERTIES ("separatorChar" = ",", "quoteChar" = "`", "escapeChar" = "\\" )
-
Does not support embedded line breaks in CSV files.
-
Converts all column type values to
STRING
. -
To recognize data types other than
STRING
, relies on the Presto parser and converts the values fromSTRING
into those data types if it can recognize them.
In particular, for data types other than STRING
this SerDe behaves as follows:
-
Recognizes
BOOLEAN
,BIGINT
,INT
, andDOUBLE
data types and parses them without changes. -
Recognizes the
TIMESTAMP
type if it is specified in the UNIX format, such asyyyy-mm-dd hh:mm:ss[.f...]
, as the typeLONG
. -
Does not support
TIMESTAMP
in the JDBC-compliantjava.sql.Timestamp
format, such as"YYYY-MM-DD HH:MM:SS.fffffffff"
(9 decimal place precision). If you are processing CSV files from Hive, use the UNIX format forTIMESTAMP
. -
Recognizes the
DATE
type if it is specified in the UNIX format, such asYYYY-MM-DD
, as the typeLONG
. -
Does not support
DATE
in another format. If you are processing CSV files from Hive, use the UNIX format forDATE
.
SerDe Name
Library Name
To use this SerDe, specify its fully qualified class name in ROW FORMAT
, also specify the delimiters inside SERDEPROPERTIES
, as follows:
... ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = ",", "quoteChar" = "`", "escapeChar" = "\\" )
Example
This example presumes a source CSV file saved in s3://mybucket/mycsv/
with the following data contents:
"a1","a2","a3","a4" "1","2","abc","def" "a","a1","abc3","ab4"
Use a CREATE TABLE
statement to create an Athena table based on this CSV file and reference the OpenCSVSerDe
class in ROW FORMAT
, also specifying SerDe properties for character separator, quote character, and escape
character.
CREATE EXTERNAL TABLE myopencsvtable ( col1 string, col2 string, col3 string, col4 string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( 'separatorChar' = ',', 'quoteChar' = '\"', 'escapeChar' = '\\' ) STORED AS TEXTFILE LOCATION 's3://location/of/csv/';
Query all values in the table.
SELECT * FROM myopencsvtable;
The query returns the following values.
col1 col2 col3 col4 ----------------------------- a1 a2 a3 a4 1 2 abc def a a1 abc3 ab4
Note
The flight table data comes from Flights provided by US Department of Transportation, Bureau of Transportation Statistics. Desaturated from original.