Using path extractors - Amazon Athena

Using path extractors

Amazon Ion is a document style file format, but Apache Hive is a flat columnar format. You can use special Amazon Ion SerDe properties called path extractors to map between the two formats. Path extractors flatten the hierarchical Amazon Ion format, map Amazon Ion values to Hive columns, and can be used to rename fields.

Athena can generate the extractors for you, but you can also define your own extractors if necessary.

Generated path extractors

By default, Athena searches for top level Amazon Ion values that match Hive column names and creates path extractors at runtime based on these matching values. If your Amazon Ion data format matches the Hive table schema, Athena dynamically generates the extractors for you, and you do not need to add any additional path extractors. These default path extractors are not stored in the table metadata.

The following example shows how Athena generates extractors based on column name.

-- Example Amazon Ion Document { identification: { name: "John Smith", driver_license: "XXXX" }, alias: "Johnny" } -- Example DDL CREATE EXTERNAL TABLE example_schema2 ( identification MAP<STRING, STRING>, alias STRING ) STORED AS ION LOCATION 's3://DOC-EXAMPLE-BUCKET/path_extraction1/'

The following example extractors are generated by Athena. The first extracts the identification field to the identification column, and the second extracts the alias field to the alias column.

'ion.identification.path_extractor' = '(identification)' 'ion.alias.path_extractor' = '(alias)'

The following example shows the extracted table.

| identification | alias | |----------------------------------------------------|----------| |{["name", "driver_license"],["John Smith", "XXXX"]} | "Johnny" |

Specifying your own path extractors

If your Amazon Ion fields do not map neatly to Hive columns, you can specify your own path extractors. In the WITH SERDEPROPERTIES clause of your CREATE TABLE statement, use the following syntax.

WITH SERDEPROPERTIES ( "ion.path_extractor.case_sensitive" = "<Boolean>", "ion.<column_name>.path_extractor" = "<path_extractor_expression>" )

By default, path extractors are case insensitive. To override this setting, set the ion.path_extractor.case_sensitive SerDe property to true.

Using search paths in path extractors

The SerDe property syntax for path extractor contains a <path_extractor_expression>:

"ion.<column_name>.path_extractor" = "<path_extractor_expression>"

You can use the <path_extractor_expression> to specify a search path that parses the Amazon Ion document and finds matching data. The search path is enclosed in parenthesis and can contain one or more of the following components separated by spaces.

  • Wild card – Matches all values.

  • Index – Matches the value at the specified numerical index. Indices are zero-based.

  • Text – Matches all values whose field names match are equivalent to the specified text.

  • Annotations – Matches values specified by a wrapped path component that has the annotations specified.

The following example shows an Amazon Ion document and some example search paths.

-- Amazon Ion document { foo: ["foo1", "foo2"] , bar: "myBarValue", bar: A::"annotatedValue" } -- Example search paths (foo 0) # matches "foo1" (1) # matches "myBarValue" (*) # matches ["foo1", "foo2"], "myBarValue" and A::"annotatedValue" () # matches {foo: ["foo1", "foo2"] , bar: "myBarValue", bar: A::"annotatedValue"} (bar) # matches "myBarValue" and A::"annotatedValue" (A::bar) # matches A::"annotatedValue"

Extractor examples

Flattening and renaming fields

The following example shows a set of search paths that flatten and rename fields. The example uses search paths to do the following:

  • Map the nickname column to the alias field

  • Map the name column to the name subfield located in the identification struct.

Following is the example Amazon Ion document.

-- Example Amazon Ion Document { identification: { name: "John Smith", driver_license: "XXXX" }, alias: "Johnny" }

The following is the example CREATE TABLE statement that defines the path extractors.

-- Example DDL Query CREATE EXTERNAL TABLE example_schema2 ( name STRING, nickname STRING ) ROW FORMAT SERDE '' WITH SERDEPROPERTIES ( 'ion.nickname.path_extractor' = '(alias)', '' = '(identification name)' ) STORED AS ION LOCATION 's3://DOC-EXAMPLE-BUCKET/path_extraction2/'

The following example shows the extracted data.

-- Extracted Table | name | nickname | |--------------|--------------| | "John Smith" | "Johnny" |

For more information about search paths and additional search path examples, see the Ion Java Path Extraction page on GitHub.

Extracting flight data to text format

The following example CREATE TABLE query uses WITH SERDEPROPERTIES to add path extractors to extract flight data and specify the output encoding as Amazon Ion text. The example uses the STORED AS ION syntax.

CREATE EXTERNAL TABLE flights_ion ( yr INT, quarter INT, month INT, dayofmonth INT, dayofweek INT, flightdate STRING, uniquecarrier STRING, airlineid INT, ) ROW FORMAT SERDE '' WITH SERDEPROPERTIES ( 'ion.encoding' = 'TEXT', 'ion.yr.path_extractor'='(year)', 'ion.quarter.path_extractor'='(results quarter)', 'ion.month.path_extractor'='(date month)') STORED AS ION LOCATION 's3://DOC-EXAMPLE-BUCKET/'