Monitoring Amazon EMR events with CloudWatch - Amazon EMR

Monitoring Amazon EMR events with CloudWatch

Amazon EMR tracks events and keeps information about them for up to seven days in the Amazon EMR console. Amazon EMR records events when there is a change in the state of clusters, instance groups, instance fleets, automatic scaling policies, or steps. Events capture the date and time the event occurred, details about the affected elements, and other critical data points.

The following table lists Amazon EMR events, along with the state or state change that the event indicates, the severity of the event, event type, event code, and event messages. Amazon EMR represents events as JSON objects and automatically sends them to an event stream. The JSON object is important when you set up rules for event processing using CloudWatch Events because rules seek to match patterns in the JSON object. For more information, see Events and event patterns and Amazon EMR events in the Amazon CloudWatch Events User Guide.

Note

To ensure that we provide you with the most pertinent information, we continuously refine our error messages. For that reason, we recommend that you don’t parse the text from the messages to initiate next actions in your workflow.

Cluster start events

State or state change Severity Event type Event code Message
CREATING WARN Amazon EMR instance fleet provisioning EC2 provisioning - Insufficient Instance Capacity We are not able to create your Amazon EMR cluster ClusterId (ClusterName) for Instance Fleet InstanceFleetID Amazon EC2 has insufficient Spot capacity for Instance type [Instancetype1, Instancetype2] and insufficient On-Demand capacity for Instance type [Instancetype3, Instancetype4] in Availability Zone [AvailabilityZone1, AvaliabilityZone2]. Check here documentation for more information on how to respond to this event.
CREATING WARN Amazon EMR instance group provisioning EC2 provisioning - Insufficient Instance Capacity We are not able to create your Amazon EMR cluster ClusterId (ClusterName) for Instance Group InstancegroupID Amazon EC2 has insufficient [Spot or On-Demand] capacity for Instance type Instancetype in Availability Zone AvailabilityZone. Check here documentation for more information on how to respond to this event.
STARTING INFO

EMR cluster state change

none

Amazon EMR cluster ClusterId (ClusterName) was requested at Time and is being created.

STARTING INFO

EMR cluster state change

none

Note

Applies only to clusters with the instance fleets configuration and multiple Availability Zones selected within Amazon EC2.

Amazon EMR cluster ClusterId (ClusterName) is being created in zone (AvailabilityZoneID), which was chosen from the specified Availability Zone options.

STARTING INFO

EMR cluster state change

none

Amazon EMR cluster ClusterId (ClusterName) began running steps at Time.

WAITING INFO

EMR cluster state change

none

Amazon EMR cluster ClusterId (ClusterName) was created at Time and is ready for use.

- or -

Amazon EMR cluster ClusterId (ClusterName) finished running all pending steps at Time.

Note

A cluster in the WAITING state may still be processing jobs.

Note

The events with event code EC2 provisioning - Insufficient Instance Capacity periodically emit when your EMR cluster encounters an insufficient capacity error from Amazon EC2 for your instance fleet or instance group during cluster creation or resize operation. For information on how to respond to these events, see Responding to Amazon EMR cluster insufficient instance capacity events.

Cluster termination events

State or state change Severity Event type Event code Message
TERMINATED

The severity depends on the reason for the state change, as shown in the following:

  • CRITICAL if the cluster terminated with any of the following state change reasons: INTERNAL_ERROR, VALIDATION_ERROR, INSTANCE_FAILURE, BOOTSTRAP_FAILURE, or STEP_FAILURE.

  • INFO if the cluster terminated with any of the following state change reasons: USER_REQUEST or ALL_STEPS_COMPLETED.

EMR cluster state change

none

Amazon EMR Cluster ClusterId (ClusterName) has terminated at Time with a reason of StateChangeReason:Code.

TERMINATED_WITH_ERRORS CRITICAL

EMR cluster state change

none

Amazon EMR Cluster ClusterId (ClusterName) has terminated with errors at Time with a reason of StateChangeReason:Code.

Instance fleet state-change events

Note

The instance fleets configuration is available only in Amazon EMR releases 4.8.0 and later, excluding 5.0.0 and 5.0.3.

State or state change Severity Event type Event code Message

From PROVISIONING to WAITING

INFO none

Provisioning for instance fleet InstanceFleetID in Amazon EMR cluster ClusterId (ClusterName) is complete. Provisioning started at Time and took Num minutes. The instance fleet now has On-Demand capacity of Num and Spot capacity of Num. Target On-Demand capacity was Num, and target Spot capacity was Num.

From WAITING to RESIZING

INFO none

A resize for instance fleet InstanceFleetID in Amazon EMR cluster ClusterId (ClusterName) started at Time. The instance fleet is resizing from an On-Demand capacity of Num to a target of Num, and from a Spot capacity of Num to a target of Num.

From RESIZING to WAITING

INFO none

The resizing operation for instance fleet InstanceFleetID in Amazon EMR cluster ClusterId (ClusterName) is complete. The resize started at Time and took Num minutes. The instance fleet now has On-Demand capacity of Num and Spot capacity of Num. Target On-Demand capacity was Num and target Spot capacity was Num.

From RESIZING to WAITING

INFO none

The resizing operation for instance fleet InstanceFleetID in Amazon EMR cluster ClusterId (ClusterName) has reached the timeout and stopped. The resize started at Time and stopped after Num minutes. The instance fleet now has On-Demand capacity of Num and Spot capacity of Num. Target On-Demand capacity was Num and target Spot capacity was Num.

SUSPENDED ERROR none

Instance fleet InstanceFleetID in Amazon EMR cluster ClusterId (ClusterName) was arrested at Time for the following reason: ReasonDesc.

RESIZING WARNING none

The resizing operation for instance fleet InstanceFleetID in Amazon EMR cluster ClusterId (ClusterName) is stuck for the following reason: ReasonDesc.

WAITING or Running

INFO none

The resizing operation for instance fleet InstanceFleetID in Amazon EMR cluster ClusterId (ClusterName) couldn't complete while Amazon EMR added Spot capacity in availability zone AvailabilityZone. We've cancelled your request to provision additional Spot capacity. For recommended actions, check Best practices for instance and Availability Zone flexibility and try again.

WAITING or Running

INFO none

A resizing operation for instance fleet InstanceFleetID in Amazon EMR cluster ClusterId (ClusterName) was initiated by Entity at Time.

Instance fleet resize events

Event type Severity Event code Message

Amazon EMR instance fleet resize

ERROR

Spot Provisioning timeout

The Resize operation for Instance Fleet InstanceFleetID in Amazon EMR cluster ClusterId (ClusterName) was not able to complete while acquiring Spot capacity in AZ AvailabilityZone. We have now cancelled your request and stopped trying to provision any additional Spot capacity and the Instance Fleet has provisioned Spot capacity of num. Target Spot capacity was num. For more information and recommended actions, please check the documentation page here and retry again.

Amazon EMR instance fleet resize

ERROR

On-Demand Provisioning timeout

The Resize operation for Instance Fleet InstanceFleetID in Amazon EMR cluster ClusterId (ClusterName) was not able to complete while acquiring On-Demand capacity in AZ AvailabilityZone. We have now cancelled your request and stopped trying to provision any additional On-Demand capacity and the Instance Fleet has provisioned On-Demand capacity of num. Target On-Demand capacity was num. For more information and recommended actions, please check the documentation page here and retry again.

Amazon EMR instance fleet resize

WARNING EC2 provisioning - Insufficient Instance Capacity

We are not able to complete the resize operation for Instance Fleet InstanceFleetID in EMR cluster ClusterId (ClusterName) as Amazon EC2 has insufficient Spot capacity for Instance types [Instancetype1, Instancetype2] and insufficient On-Demand capacity for Instance types [Instancetype3, Instancetype4] in Availability Zone [AvailabilityZone1]. So far, the instance fleet has provisioned On-Demand capacity of num and target On-Demand capacity was num. Provisioned Spot capacity is num and target Spot capacity was num. Check here documentation for more information on how to respond to this event.

Amazon EMR instance fleet resize

WARNING

Spot Provisioning Timeout - Continuing Resize

We're still provisioning Spot capacity for the Instance Fleet resize operation that initiated at time for instance fleet ID InstanceFleetID in Amazon EMR cluster ClusterId (ClusterName) for [Instancetype1, Instancetype2] in AZ AvailabilityZone. For the previous resize operation that initiated at time, the timeout period expired, so Amazon EMR stopped provisioning Spot capacity after adding num of the requested num instances to your instance fleet. For more information, please check the documentation page here.

Amazon EMR instance fleet resize

WARNING

On-Demand Provisioning Timeout - Continuing Resize

We're still provisioning On-Demand capacity for the Instance Fleet resize operation that initiated at time for instance fleet ID InstanceFleetID in Amazon EMR cluster ClusterId (ClusterName) for [Instancetype1, Instancetype2] in AZ AvailabilityZone. For the previous resize operation that initiated at time, the timeout period expired, so Amazon EMR stopped provisioning On-Demand capacity after adding num of the requested num instances to your instance fleet. For more information, please check the documentation page here.

Note

The provisioning timeout events are emitted when Amazon EMR stops provisioning Spot or On-demand capacity for the fleet after the timeout expires. For information on how to respond to these events, see Responding to Amazon EMR cluster instance fleet resize timeout events .

Instance group events

Event type Severity Event code Message

From RESIZING to Running

INFO none

The resizing operation for instance group InstanceGroupID in Amazon EMR cluster ClusterId (ClusterName) is complete. It now has an instance count of Num. The resize started at Time and took Num minutes to complete.

From RUNNING to RESIZING

INFO none

A resize for instance group InstanceGroupID in Amazon EMR cluster ClusterId (ClusterName) started at Time. It is resizing from an instance count of Num to Num.

SUSPENDED ERROR none

Instance group InstanceGroupID in Amazon EMR cluster ClusterId (ClusterName) was arrested at Time for the following reason: ReasonDesc.

RESIZING WARNING none

The resizing operation for instance group InstanceGroupID in Amazon EMR cluster ClusterId (ClusterName) is stuck for the following reason: ReasonDesc.

Amazon EMR instance group resize

WARNING EC2 provisioning - Insufficient Instance Capacity

We are not able to complete the resize operation that started at time for Instance Group InstanceGroupID in EMR cluster ClusterId (ClusterName) as Amazon EC2 has insufficient Spot/On Demand capacity for Instance type [Instancetype] in Availability Zone [AvailabilityZone1]. So far, the instance group has a running instance count of num and requested instance count was num. Check here documentation for more information on how to respond to this event.

From RUNNING to RESIZING

INFO none

A resize for instance group InstanceGroupID in Amazon EMR cluster ClusterId (ClusterName) was initiated by Entity at Time.

Note

With Amazon EMR version 5.21.0 and later, you can override cluster configurations and specify additional configuration classifications for each instance group in a running cluster. You do this by using the Amazon EMR console, the AWS Command Line Interface (AWS CLI), or the AWS SDK. For more information, see Supplying a Configuration for an Instance Group in a Running Cluster.

The following table lists Amazon EMR events for the reconfiguration operation, along with the state or state change that the event indicates, the severity of the event, and event messages.

State or state change Severity Message
RUNNING INFO

A reconfiguration for instance group InstanceGroupID in the Amazon EMR cluster ClusterId (ClusterName) was initiated by user at Time. Version of requested configuration is Num.

From RECONFIGURING to Running

INFO

The reconfiguration operation for instance group InstanceGroupID in the Amazon EMR cluster ClusterId (ClusterName) is complete. The reconfiguration started at Time and took Num minutes to complete. Current configuration version is Num.

From RUNNING to RECONFIGURING

in
INFO

A reconfiguration for instance group InstanceGroupID in the Amazon EMR cluster ClusterId (ClusterName) started at Time. It is configuring from version number Num to version number Num.

RESIZING INFO

Reconfiguring operation towards configuration version Num for instance group InstanceGroupID in the Amazon EMR cluster ClusterId (ClusterName) is temporarily blocked at Time because instance group is in State.

RECONFIGURING INFO Resizing operation towards instance count Num for instance group InstanceGroupID in the Amazon EMR cluster ClusterId (ClusterName) is temporarily blocked at Time because the instance group is in State.
RECONFIGURING WARNING

The reconfiguration operation for instance group InstanceGroupID in the Amazon EMR cluster ClusterId (ClusterName) failed at Time and took Num minutes to fail. Failed configuration version is Num.

RECONFIGURING INFO

Configurations are reverting to the previous successful version number Numfor instance group InstanceGroupID in the Amazon EMR cluster ClusterId (ClusterName) at Time. New configuration version is Num.

From RECONFIGURING to Running

INFO

Configurations were successfully reverted to the previous successful version Num for instance group InstanceGroupID in the Amazon EMR cluster ClusterId (ClusterName) at Time. New configuration version is Num.

From RECONFIGURING to SUSPENDED

CRITICAL

Failed to revert to the previous successful version Num for Instance group InstanceGroupID in the Amazon EMR cluster ClusterId (ClusterName) at Time.

Automatic scaling policy events

State or state change Severity Message
PENDING INFO

An Auto Scaling policy was added to instance group InstanceGroupID in Amazon EMR cluster ClusterId (ClusterName) at Time. The policy is pending attachment.

- or -

The Auto Scaling policy for instance group InstanceGroupID in Amazon EMR cluster ClusterId (ClusterName) was updated at Time. The policy is pending attachment.

ATTACHED INFO

The Auto Scaling policy for instance group InstanceGroupID in Amazon EMR cluster ClusterId (ClusterName) was attached at Time.

DETACHED

INFO

The Auto Scaling policy for instance group InstanceGroupID in Amazon EMR cluster ClusterId (ClusterName) was detached at Time.

FAILED ERROR

The Auto Scaling policy for instance group InstanceGroupID in Amazon EMR cluster ClusterId (ClusterName) could not attach and failed at Time.

- or -

The Auto Scaling policy for instance group InstanceGroupID in Amazon EMR cluster ClusterId (ClusterName) could not detach and failed at Time.

Step events

State or state change Severity Message
PENDING INFO

Step StepID (StepName) was added to Amazon EMR cluster ClusterId (ClusterName) at Time and is pending execution.

CANCEL_PENDING WARN

Step StepID (StepName) in Amazon EMR cluster ClusterId (ClusterName) was cancelled at Time and is pending cancellation.

RUNNING INFO

Step StepID (StepName) in Amazon EMR cluster ClusterId (ClusterName) started running at Time.

COMPLETED INFO

Step StepID (StepName) in Amazon EMR cluster ClusterId (ClusterName) completed execution at Time. The step started running at Time and took Num minutes to complete.

CANCELLED WARN

Cancellation request has succeeded for cluster step StepID (StepName) in Amazon EMR cluster ClusterId (ClusterName) at Time, and the step is now cancelled.

FAILED ERROR

Step StepID (StepName) in Amazon EMR cluster ClusterId (ClusterName) failed at Time.

Unhealthy node replacement events

Event type Severity Event code Message

Amazon EMR unhealthy node replacement

INFO

Unhealthy core node detected

Amazon EMR has identified that core instance [instanceID (InstanceName)] in InstanceGroup/Fleet in the Amazon EMR cluster clusterID (ClusterName) is UNHEALTHY. Amazon EMR will attempt to recover or gracefully replace the UNHEALTHY instance.

Amazon EMR unhealthy node replacement

INFO

Core node unhealthy - replacement disabled

Amazon EMR has identified that core instance [instanceID (InstanceName)] in InstanceGroup/Fleet in the Amazon EMR cluster {clusterID} (ClusterName) is UNHEALTHY. Turn on graceful unhealthy core node replacement in your cluster to let Amazon EMR gracefully replace the UNHEALTHY instances in the event that they can’t be recovered.

Amazon EMR unhealthy node replacement

WARN

Unhealthy core node not replaced

Amazon EMR can't replace your UNHEALTHY core instance [instanceID (InstanceName)] in InstanceGroup/Fleet in the Amazon EMR cluster clusterID (ClusterName) because of reason.

Note

The reason of why Amazon EMR can't replace your core node differs depending on your scenario. For example, one reason of why Amazon EMR can't delete a node is because a cluster wouldn't have any remaining core nodes.

Amazon EMR unhealthy node replacement

INFO

Unhealthy core node recovered

Amazon EMR has recovered your UNHEALTHY core instances [instanceID (InstanceName)] in InstanceGroup/Fleet in the Amazon EMR cluster clusterID (ClusterName)

For more information about unhealthy node replacement, see Replacing unhealthy nodes.

Viewing events with the Amazon EMR console

For each cluster, you can view a simple list of events in the details pane, which lists events in descending order of occurrence. You can also view all events for all clusters in a region in descending order of occurrence.

If you don't want a user to see all cluster events for a region, add a statement that denies permission ("Effect": "Deny") for the elasticmapreduce:ViewEventsFromAllClustersInConsole action to a policy that is attached to the user.

Note

We’ve redesigned the Amazon EMR console to make it easier to use. See What's new with the console? to learn about the differences between the old and new console experiences.

New console
To view events for all clusters in a Region with the new console
  1. Sign in to the AWS Management Console, and open the Amazon EMR console at https://console.aws.amazon.com/emr.

  2. Under EMR on EC2 in the left navigation pane, choose Events.

To view events for a particular cluster with the new console
  1. Sign in to the AWS Management Console, and open the Amazon EMR console at https://console.aws.amazon.com/emr.

  2. Under EMR on EC2 in the left navigation pane, choose Clusters, and then choose a cluster.

  3. To view all of your events, select the Events tab on the cluster details page.

Old console
To view events for all clusters in a Region with the old console
  1. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Choose Events.

To view events for a particular cluster with the old console
  1. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Choose Cluster List, select a cluster, and then choose View details.

  3. Choose Events in the cluster details pane.