AWS CloudTrail is a service that records AWS API calls and events for AWS accounts. CloudTrail generates encrypted log files and stores them in Amazon S3. You can use Athena to query these log files directly from Amazon S3, specifying the
LOCATION of log files. For more information, see the AWS CloudTrail User Guide.
Using Athena with CloudTrail logs is a powerful way to enhance your analysis of AWS service activity. You can use queries to identify trends and further isolate activity by attribute, such as source IP address or user. A common application is to use CloudTrail logs to analyze operational activity for security and compliance. For a detailed example, see the AWS Big Data Blog post, Analyze Security, Compliance, and Operational Activity Using AWS CloudTrail and Amazon Athena.
AWS Cloudtrail saves logs as JSON text files in compressed gzip format (*.json.gzip). The location of log files depends on how you set up trails, the region or regions you are logging, and other factors. For more information about where logs are stored, the JSON structure, and the file format, see the following topics in the AWS CloudTrail User Guide:
Certain fields in CloudTrail logs are string values that may have a variable data format, which depends on the service. As a result, the CloudTrail SerDe is unable to predictably deserialize them. If you need to query the following fields, identify the data pattern and then use a different SerDe, such as the OpenX JSON SerDe.
The CloudTrail SerDe simplifies the process of deserializing data from CloudTrail log files to create Athena tables.
The following example uses the CloudTrail SerDe on a fictional set of log files to create a table based on all log files for an account with the ID
123456789012. Replace the
LOCATION clause with the path to the CloudTrail log location and set of objects with which to work. The example uses a
LOCATION value of logs for a particular account, but you can use the degree of specificity that suits your application. For example, to analyze data from multiple accounts, you can roll back the
LOCATION specifier to indicate all
AWSLogs by using
LOCATION 's3://MyLogFiles/AWSLogs/. To analyze data from a specific date, account, and region, you might use
LOCATION `s3://MyLogFiles/123456789012/CloudTrail/us-east-1/2016/03/14/'. Using the highest level in the object hierarchy gives you the greatest flexibility when you query using Athena.
CREATE EXTERNAL TABLE my_cloudtrail_table ( eventversion STRING, userIdentity STRUCT< type:STRING, principalid:STRING, arn:STRING, accountid:STRING, invokedby:STRING, accesskeyid:STRING, userName:String, sessioncontext:STRUCT< attributes:STRUCT< mfaauthenticated:STRING, creationdate:STRING>, sessionIssuer:STRUCT< type:STRING, principalId:STRING, arn:STRING, accountId:STRING, userName:STRING>>>, eventTime STRING, eventSource STRING, eventName STRING, awsRegion STRING, sourceIpAddress STRING, userAgent STRING, errorCode STRING, errorMessage STRING, requestId STRING, eventId STRING, resources ARRAY<STRUCT< ARN:STRING, accountId:STRING, type:STRING>>, eventType STRING, apiVersion STRING, readOnly BOOLEAN, recipientAccountId STRING, sharedEventID STRING, vpcEndpointId STRING ) ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerDe' STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://MyLogFiles/AWSLogs/123456789012/';
The following query returns the logins that occurred over a 24-hour period.
select useridentity.username, sourceipaddress, eventtime, additionaleventdata from default.my_cloudtrail_table where eventname = 'ConsoleLogin' and eventtime >= '2017-02-17T00:00:00Z' and eventtime < '2017-02-18T00:00:00Z';