Amazon Athena
User Guide  | API Reference

Grok

Logstash Grok is a filter that you can use with Amazon Athena by specifying the Grok SerDe when you create a table. Grok is useful for applying patterns for deserialization to unstructured text files, usually logs. The patterns are easier to use than regular expressions because each Grok pattern is essentially a named regular expression. This makes it easier to identify and re-use deserialization patterns. Grok provides a set of pre-defined patterns. You can also create custom patterns.

Specify the Grok SerDe by using the ROW FORMAT SERDE 'com.amazonaws.glue.serde.GrokSerDe' clause, followed by the WITH SERDEPROPERTIES clause to specify the patterns to match in your data. The input.format expression within this clause is required. It defines the patterns to match in the data file. The input.grokCustomPatterns expression is optional. It defines a named custom pattern, which you can subsequently use within the input.format expression.

The STORED AS INPUTFORMAT and OUTPUTFORMAT clauses shown in the following example are required. The LOCATION clause specifies an Amazon S3 bucket, which can contain multiple source data files. All files in the bucket are deserialized to create the table.

Example#

This example uses a single fictional text file saved in s3://mybucket/groksample with the following data, which represents Postfix maillog entries.

Feb  9 07:15:00 m4eastmail postfix/smtpd[19305]: B88C4120838: connect from unknown[192.168.55.4]
Feb  9 07:15:00 m4eastmail postfix/smtpd[20444]: B58C4330038: client=unknown[192.168.55.4]
Feb  9 07:15:03 m4eastmail postfix/cleanup[22835]: BDC22A77854: message-id=<31221401257553.5004389LCBF@m4eastmail.example.com>

The following statement creates a table in Athena called mygroktable from the source data file, using a custom pattern and the predefined patterns that you specify.

CREATE EXTERNAL TABLE `mygroktable`(
   'SYSLOGBASE' string,
   'queue_id' string,
   'syslog_message' string
   )
ROW FORMAT SERDE
   'com.amazonaws.glue.serde.GrokSerDe'
WITH SERDEPROPERTIES (
   'input.grokCustomPatterns' = 'POSTFIX_QUEUEID [0-9A-F]{7,12}',
   'input.format'='%{SYSLOGBASE} %{POSTFIX_QUEUEID:queue_id}: %{GREEDYDATA:syslog_message}'
   )
STORED AS INPUTFORMAT
   'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
   's3://mybucket/groksample';

On this page: