Using path extractors
Amazon Ion is a document style file format, but Apache Hive is a flat columnar format.
You can use special Amazon Ion SerDe properties called path extractors
to map
between the two formats. Path extractors flatten the hierarchical Amazon Ion format, map Amazon Ion
values to Hive columns, and can be used to rename fields.
Athena can generate the extractors for you, but you can also define your own extractors if necessary.
Generated path extractors
By default, Athena searches for top level Amazon Ion values that match Hive column names and creates path extractors at runtime based on these matching values. If your Amazon Ion data format matches the Hive table schema, Athena dynamically generates the extractors for you, and you do not need to add any additional path extractors. These default path extractors are not stored in the table metadata.
The following example shows how Athena generates extractors based on column name.
-- Example Amazon Ion Document { identification: { name: "John Smith", driver_license: "XXXX" }, alias: "Johnny" } -- Example DDL CREATE EXTERNAL TABLE example_schema2 ( identification MAP<STRING, STRING>, alias STRING ) STORED AS ION LOCATION 's3://
DOC-EXAMPLE-BUCKET
/path_extraction1/'
The following example extractors are generated by Athena. The first extracts the
identification
field to the identification
column, and
the second extracts the alias
field to the alias
column.
'ion.identification.path_extractor' = '(identification)' 'ion.alias.path_extractor' = '(alias)'
The following example shows the extracted table.
| identification | alias | |----------------------------------------------------|----------| |{["name", "driver_license"],["John Smith", "XXXX"]} | "Johnny" |
Specifying your own path extractors
If your Amazon Ion fields do not map neatly to Hive columns, you can specify your own
path extractors. In the WITH SERDEPROPERTIES
clause of your
CREATE TABLE
statement, use the following syntax.
WITH SERDEPROPERTIES ( "ion.path_extractor.case_sensitive" = "
<Boolean>
", "ion.<column_name>
.path_extractor" = "<path_extractor_expression>
" )
By default, path extractors are case insensitive. To override this setting,
set the ion.path_extractor.case_sensitive SerDe property to
true
.
Using search paths in path extractors
The SerDe property syntax for path extractor contains a
<path_extractor_expression>
:
"ion.
<column_name>
.path_extractor" = "<path_extractor_expression>
"
You can use the <path_extractor_expression>
to specify a search path that parses the Amazon Ion document and finds matching data.
The search path is enclosed in parenthesis and can contain one or more of the
following components separated by spaces.
-
Wild card – Matches all values.
-
Index – Matches the value at the specified numerical index. Indices are zero-based.
-
Text – Matches all values whose field names match are equivalent to the specified text.
-
Annotations – Matches values specified by a wrapped path component that has the annotations specified.
The following example shows an Amazon Ion document and some example search paths.
-- Amazon Ion document { foo: ["foo1", "foo2"] , bar: "myBarValue", bar: A::"annotatedValue" } -- Example search paths (foo 0) # matches "foo1" (1) # matches "myBarValue" (*) # matches ["foo1", "foo2"], "myBarValue" and A::"annotatedValue" () # matches {foo: ["foo1", "foo2"] , bar: "myBarValue", bar: A::"annotatedValue"} (bar) # matches "myBarValue" and A::"annotatedValue" (A::bar) # matches A::"annotatedValue"
Extractor examples
Flattening and renaming fields
The following example shows a set of search paths that flatten and rename fields. The example uses search paths to do the following:
-
Map the
nickname
column to thealias
field -
Map the
name
column to thename
subfield located in theidentification
struct.
Following is the example Amazon Ion document.
-- Example Amazon Ion Document { identification: { name: "John Smith", driver_license: "XXXX" }, alias: "Johnny" }
The following is the example CREATE TABLE
statement that defines
the path extractors.
-- Example DDL Query CREATE EXTERNAL TABLE example_schema2 ( name STRING, nickname STRING ) ROW FORMAT SERDE 'com.amazon.ionhiveserde.IonHiveSerDe' WITH SERDEPROPERTIES ( 'ion.nickname.path_extractor' = '(alias)', 'ion.name.path_extractor' = '(identification name)' ) STORED AS ION LOCATION 's3://
DOC-EXAMPLE-BUCKET
/path_extraction2/'
The following example shows the extracted data.
-- Extracted Table | name | nickname | |--------------|--------------| | "John Smith" | "Johnny" |
For more information about search paths and additional search path examples,
see the Ion Java Path Extraction
Extracting flight data to text format
The following example CREATE TABLE
query uses WITH
SERDEPROPERTIES
to add path extractors to extract flight data and
specify the output encoding as Amazon Ion text. The example uses the STORED AS
ION
syntax.
CREATE EXTERNAL TABLE flights_ion ( yr INT, quarter INT, month INT, dayofmonth INT, dayofweek INT, flightdate STRING, uniquecarrier STRING, airlineid INT, ) ROW FORMAT SERDE 'com.amazon.ionhiveserde.IonHiveSerDe' WITH SERDEPROPERTIES ( 'ion.encoding' = 'TEXT', 'ion.yr.path_extractor'='(year)', 'ion.quarter.path_extractor'='(results quarter)', 'ion.month.path_extractor'='(date month)') STORED AS ION LOCATION 's3://
DOC-EXAMPLE-BUCKET
/'