Querying flow logs using Amazon Athena - Amazon Virtual Private Cloud

Querying flow logs using Amazon Athena

Amazon Athena is an interactive query service that enables you to analyze data in Amazon S3, such as your flow logs, using standard SQL. You can use Athena with VPC Flow Logs to quickly get actionable insights about the traffic flowing through your VPC. For example, you can identify which resources in your virtual private clouds (VPCs) are the top talkers or identify the IP addresses with the most rejected TCP connections.

You can streamline and automate the integration of your VPC flow logs with Athena by generating a CloudFormation template that creates the required AWS resources and predefined queries that you can run to obtain insights about the traffic flowing through your VPC.

The CloudFormation template creates the following resources:

  • An Athena database. The database name is vpcflowlogsathenadatabase<flow-logs-subscription-id>.

  • An Athena workgroup. The workgroup name is <flow-log-subscription-id><partition-load-frequency><start-date><end-date>workgroup

  • A partitioned Athena table that corresponds to your flow log records. The table name is <flow-log-subscription-id><partition-load-frequency><start-date><end-date>.

  • A set of Athena named queries. For more information, see Predefined queries.

  • A Lambda function that loads new partitions to the table on the specified schedule (daily, weekly, or monthly).

  • An IAM role that grants permission to run the Lambda functions.

Requirements

  • You must select a Region that supports AWS Lambda and Amazon Athena.

  • The Amazon S3 buckets must be in the selected Region.

Pricing

You incur standard Amazon Athena charges for running queries. You incur standard AWS Lambda charges for the Lambda function that loads new partitions on a recurring schedule (when you specify a partition load frequency but do not specify a start and end date.)

Generating the CloudFormation template using the console

After the first flow logs are delivered to your S3 bucket, you can integrate with Athena by generating a CloudFormation template and using the template to create a stack.

To generate the template using the console

  1. Do one of the following:

    • Open the Amazon VPC console. In the navigation pane, choose Your VPCs and then select your VPC.

    • Open the Amazon VPC console. In the navigation pane, choose Subnets and then select your subnet.

    • Open the Amazon EC2 console. In the navigation pane, choose Network Interfaces and then select your network interface.

  2. On the Flow logs tab, select a flow log that publishes to Amazon S3 and then choose Actions, Generate Athena integration.

  3. Specify the partition load frequency. If you choose None, you must specify the partition start and end date, using dates that are in the past. If you choose Daily, Weekly, or Monthly, the partition start and end dates are optional. If you do not specify start and end dates, the CloudFormation template creates a Lambda function that loads new partitions on a recurring schedule.

  4. Select or create an S3 bucket for the generated template, and an S3 bucket for the query results.

  5. Choose Generate Athena integration.

  6. (Optional) In the success message, choose the link to navigate to the bucket that you specified for the CloudFormation template, and customize the template.

  7. In the success message, choose Create CloudFormation stack to open the Create Stack wizard in the AWS CloudFormation console. The URL for the generated CloudFormation template is specified in the Template section. Complete the wizard to create the resources that are specified in the template.

Generating the CloudFormation template using the AWS CLI

After the first flow logs are delivered to your S3 bucket, you can generate and use a CloudFormation template to integrate with Athena.

Use the following get-flow-logs-integration-template command to generate the CloudFormation template.

aws ec2 get-flow-logs-integration-template --cli-input-json file://config.json

The following is an example of the config.json file.

{ "FlowLogId": "fl-12345678901234567", "ConfigDeliveryS3DestinationArn": "arn:aws:s3:::my-flow-logs-athena-integration/templates/", "IntegrateServices": { "AthenaIntegrations": [ { "IntegrationResultS3DestinationArn": "arn:aws:s3:::my-flow-logs-analysis/athena-query-results/", "PartitionLoadFrequency": "monthly", "PartitionStartDate": "2021-01-01T00:00:00", "PartitionEndDate": "2021-12-31T00:00:00" } ] } }

Use the following create-stack command to create a stack using the generated CloudFormation template.

aws cloudformation create-stack --stack-name my-vpc-flow-logs --template-body file://my-cloudformation-template.json

Running a predefined query

The generated CloudFormation template provides a set of predefined queries that you can run to quickly get meaningful insights about the traffic in your AWS network. After you create the stack and verify that all resources were created correctly, you can run one of the predefined queries.

To run a predefined query using the console

  1. Open the Athena console. In the Workgroups panel, select the workgroup created by the CloudFormation template.

  2. Select one of the predefined queries, modify the parameters as needed, and then run the query.

  3. Open the Amazon S3 console. Navigate to the bucket that you specified for the query results, and view the results of the query.

Predefined queries

The following are the Athena named queries provided by the generated CloudFormation template:

  • VpcFlowLogsAcceptedTraffic – The TCP connections that were allowed based on your security groups and network ACLs.

  • VpcFlowLogsAdminPortTraffic – The traffic recorded on administrative web app ports.

  • VpcFlowLogsIPv4Traffic – The total bytes of IPv4 traffic recorded.

  • VpcFlowLogsIPv6Traffic – The total bytes of IPv6 traffic recorded.

  • VpcFlowLogsRejectedTCPTraffic – The TCP connections that were rejected based on your security groups or network ACLs.

  • VpcFlowLogsRejectedTraffic – The traffic that was rejected based on your security groups or network ACLs.

  • VpcFlowLogsSshRdpTraffic – The SSH and RDP traffic.

  • VpcFlowLogsTopTalkers – The 50 IP addresses with the most traffic recorded.

  • VpcFlowLogsTopTalkersPacketLevel – The 50 packet-level IP addresses with the most traffic recorded.

  • VpcFlowLogsTopTalkingInstances – The IDs of the 50 instances with the most traffic recorded.

  • VpcFlowLogsTopTalkingSubnets – The IDs of the 50 subnets with the most traffic recorded.

  • VpcFlowLogsTopTCPTraffic – All TCP traffic recorded for a source IP address.

  • VpcFlowLogsTotalBytesTransferred – The 50 pairs of source and destination IP addresses with the most bytes recorded.

  • VpcFlowLogsTotalBytesTransferredPacketLevel – The 50 pairs of packet-level source and destination IP addresses with the most bytes recorded.

  • VpcFlowLogsTrafficFrmSrcAddr – The traffic recorded for a specific source IP address.

  • VpcFlowLogsTrafficToDstAddr – The traffic recorded for a specific destination IP address.