Amazon Athena
User Guide

Querying Amazon VPC Flow Logs

Amazon Virtual Private Cloud flow logs capture information about the IP traffic going to and from network interfaces in a VPC. Use the logs to investigate network traffic patterns and identify threats and risks across your VPC network.

Before you begin querying the logs in Athena, enable VPC flow logs and export log data to Amazon S3. After you create the logs, let them run for a few minutes to collect some data.

Creating the Table for VPC Flow Logs

To create the Amazon VPC table

  1. Copy and paste the following DDL statement into the Athena console.

  2. Modify the LOCATION 's3://your_log_bucket/prefix/' to point to the S3 bucket that contains your log data.

    CREATE EXTERNAL TABLE IF NOT EXISTS vpc_flow_logs ( ts string, version int, account string, interfaceid string, sourceaddress string, destinationaddress string, sourceport int, destinationport int, protocol int, numpackets int, numbytes int, starttime int, endtime int, action string, logstatus string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "^([^ ]+)\\s+([0-9]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([0-9]+)\\s+([0-9]+)\\s+([0-9]+)\\s+([0-9]+)\\s+([0-9]+)\\s+([0-9]+)\\s+([0-9]+)\\s+([^ ]+)\\s+([^ ]+)$" ) LOCATION 's3://your_log_bucket/prefix/';
  3. Run the query in Athena console. After the query completes, Athena registers the vpc_flow_logs table, making the data in it ready for you to issue queries.

Example Queries for Amazon VPC Flow Logs

The following query lists all of the rejected TCP connections. The query uses Date and Time Functions and Operators to convert the timestamp field ts, and extracts only the day of the week for which these events occurred.

SELECT day_of_week(from_iso8601_timestamp(ts)) AS day, interfaceid, sourceaddress, action, protocol FROM vpc_flow_logs WHERE action = 'REJECT' AND protocol = 6 LIMIT 100;

To see which one of your servers is receiving the highest number of HTTPS requests, use this query. It counts the number of packets received on HTTPS port 443, groups them by destination IP address, and returns the top 10.

SELECT SUM(numpackets) AS packetcount, destinationaddress FROM vpc_flow_logs WHERE destinationport = 443 GROUP BY destinationaddress ORDER BY packetcount DESC LIMIT 10;

For more information, see the AWS Big Data blog post Analyzing VPC Flow Logs with Amazon Kinesis Firehose, Athena, and Amazon QuickSight.