AWSSupport-AnalyzeEMRLogs
Description
This runbook helps identify errors while running a job on an Amazon EMR cluster. The runbook analyzes a list of defined logs on the file system and looks for a list of predefined keywords. These log entries are used to create Amazon CloudWatch Events events so you can take any needed actions based on the events. Optionally, the runbook publishes log entries to the Amazon CloudWatch Logs log group of your choosing. This runbook currently looks for the following errors and patterns in log files:
-
container_out_of_memory – YARN container ran out of memory, running job may fail.
-
yarn_nodemanager_health: CORE or TASK node is running low on disk space and will not be able to run tasks.
-
node_state_change: CORE or TASK node is unreachable by the MASTER node.
-
step_failure: An EMR Step has failed.
-
no_core_nodes_running: No CORE nodes are currently running, cluster is unhealthy.
-
hdfs_missing_blocks: There are missing HDFS blocks which could lead to data loss.
-
hdfs_high_util: HDFS Utilization is high, which may affect jobs and cluster health.
-
instance_controller_restart: Instance-Controller process has restarted. This process is essential for cluster health.
-
instance_controller_restart_legacy: Instance-Controller process has restarted. This process is essential for cluster health.
-
high_load: High Load Average detected, may affect node health reporting or result in timeouts or slowdowns.
-
yarn_node_blacklisted: CORE or TASK node has been blacklisted by YARN from running tasks.
-
yarn_node_lost: CORE or TASK node has been marked as LOST by YARN, possible connectivity issues.
Instances associated with the ClusterID
that you specify must be
managed by AWS Systems Manager. You can run this automation once, schedule the automation to
run at a specific time interval, or remove a schedule created previously by an
automation. This runbook supports Amazon EMR release versions 5.20 to 6.30.
Document type
Automation
Owner
Amazon
Platforms
Linux, macOS, Windows
Parameters
-
AutomationAssumeRole
Type: String
Description: (Optional) The Amazon Resource Name (ARN) of the AWS Identity and Access Management (IAM) role that allows Systems Manager Automation to perform the actions on your behalf. If no role is specified, Systems Manager Automation uses the permissions of the user that starts this runbook.
-
ClusterID
Type: String
Description: (Required) The ID of the cluster whose nodes logs you want to analyze.
-
Operation
Type: String
Valid values: Run Once | Schedule | Remove Schedule
Description: (Required) The operation to perform on the cluster.
-
IntervalTime
Type: String
Valid values: 5 minutes | 10 minutes | 15 minutes
Description: (Optional) The duration of time between running the automation. This parameter is only applicable if you specify
Schedule
for theOperation
parameter. -
LogToCloudWatchLogs
Type: String
Valid values: yes | no
Description: (Optional) If you specify
yes
for the value of this parameter, the automation creates a CloudWatch Logs log group with the name specified in theCloudWatchLogGroup
parameter to store any matched log entries. -
CloudWatchLogGroup
Type: String
Description: (Optional) The name of the CloudWatch Logs log group you want to store any matched log entries in. This parameter is only applicable if you specify
yes
for theLogToCloudWatchLogs
parameter. -
CreateLogInsightsDashboard
Type: String
Valid values: yes | no
Description: (Optional) If you specify
yes
, CloudWatch dashboard is created if it does not already exist. This parameter is only applicable if you specifyyes
for theLogToCloudWatchLogs
parameter. -
CreateMetricFilters
Type: String
Valid values: yes | no
Description: (Optional) Specify
yes
if you want to create metric filters for the CloudWatch Logs log group. This parameter is only applicable if you specifyyes
for theLogToCloudWatchLogs
parameter.
Required IAM permissions
The AutomationAssumeRole
parameter requires the following actions to
use the runbook successfully.
-
ssm:StartAutomationExecution
-
ssm:GetDocument
-
ssm:ListDocuments
-
ssm:DescribeAutomationExecutions
-
ssm:DescribeAutomationStepExecutions
-
ssm:GetAutomationExecution
-
ssm:DescribeInstanceInformation
-
ssm:ListCommandInvocations
-
ssm:ListCommands
-
ssm:SendCommand
-
iam:CreateRole
-
iam:DeleteRole
-
iam:GetRolePolicy
-
iam:PutRolePolicy
-
iam:DeleteRolePolicy
-
iam:passrole
-
cloudformation:DescribeStacks
-
cloudformation:DeleteStack
-
cloudformation:CreateStack
-
events:DeleteRule
-
events:RemoveTargets
-
events:PutTargets
-
events:PutRule
-
events:DescribeRule
-
logs:DescribeLogGroups
-
logs:CreateLogGroup
-
logs:PutMetricFilter
-
cloudwatch:PutDashboard
-
elasticmapreduce:ListInstances
-
elasticmapreduce:DescribeCluster
Document Steps
-
aws:executeAwsApi
- Gathers information about the Amazon EMR cluster specified in theClusterID
parameter. -
aws:branch
- Branches based on input.-
If the provided operation is
Run Once
orSchedule
:-
aws:assertAwsResourceProperty
- Verifies the cluster is available. -
aws:executeAwsApi
- Gathers the IDs of all instances running in the cluster. -
aws:assertAwsResourceProperty
- Verifies the SSM Agent is running on all instances in the cluster. -
aws:branch
- Branches based on whether you specified to run the automation once or on a schedule.-
If the provided operation is
Run Once
:-
aws:branch
- Branches based on the value specified in theLogToCloudWatchLogs
parameter.-
If
LogToCloudWatchLogs
value isyes
:-
aws:executeScript
- Checks if a CloudWatch Logs log group with the name specified in parameterCloudWatchLogGroup
already exists. If not, the group is created with the name specified. -
aws:branch
- Branches based on the value specified in theCreateMetricFilters
parameter.-
If
CreateMetricFilters
value isyes
:-
aws:executeAwsApi
- 12 steps are ran for each metric filter -
aws:branch
- Branches based on the value specified in theCreateLogInsightsDashboard
parameter.-
If
CreateLogInsightsDashboard
value isyes
:-
aws:executeAwsApi
- Creates a CloudWatch dashboard with the same name specified in theCloudWatchLogGroup
parameter, if it does not already exist.
-
-
If
CreateLogInsightsDashboard
value isno
:-
aws:runCommand
- Runs a shell script to find log patterns on each instance in the cluster.
-
-
-
-
If
CreateMetricFilters
value isno
:-
aws:branch
- Branches based on the value specified inCreateLogInsightsDashboard
parameter.-
If
CreateLogInsightsDashboard
value isyes
:-
aws:executeAwsApi
- Creates a CloudWatch dashboard with the same name specified in theCloudWatchLogGroup
parameter, if it does not already exist.
-
-
If
CreateLogInsightsDashboard
value isno
:-
aws:runCommand
- Runs a shell script to find log patterns on each instance in the cluster.
-
-
-
-
-
-
If
LogToCloudWatchLogs
value isno
:-
aws:executeAwsApi
- Runs a shell script to find log patterns on each instance in the cluster.
-
-
-
-
If the provided operation is
Schedule
:-
aws:createStack
- Creates an Amazon EventBridge event that targets this runbook.
-
-
-
-
If the provided operation is
Remove Schedule
:-
aws:executeAwsApi
- Verifies a schedule exists for the cluster. -
aws:deleteStack
- Deletes the schedule.
-
-
Outputs
GetClusterInformation.ClusterName
GetClusterInformation.ClusterState
ListingClusterInstances.InstanceIDs
CreatingScheduleCloudFormationStack.StackStatus
RemovingScheduleByDeletingScheduleCloudFormationStack.StackStatus
CheckIfLogGroupExists.output
FindLogPatternOnEMRNode.CommandId