Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
Did this page help you?  Yes | No |  Tell us about it...
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.

Monitor Metrics with CloudWatch

When you’re running a cluster, you often want to track its progress and health. Amazon EMR records metrics that can help you monitor your cluster. It makes these metrics available in the Amazon EMR console and in the CloudWatch console, where you can track them with your other AWS metrics. In CloudWatch, you can set alarms to warn you if a metric goes outside parameters you specify.

Metrics are updated every five minutes. This interval is not configurable. Metrics are archived for two weeks; after that period, the data is discarded.

These metrics are automatically collected and pushed to CloudWatch for every Amazon EMR cluster. There is no charge for the Amazon EMR metrics reported in CloudWatch; they are provided as part of the Amazon EMR service.

Note

Viewing Amazon EMR metrics in CloudWatch is supported only for clusters launched with AMI 2.0.3 or later and running Hadoop 0.20.205 or later. For more information about selecting the AMI version for your cluster, see Choose an Amazon Machine Image (AMI).

How Do I Use Amazon EMR Metrics?

The metrics reported by Amazon EMR provide information that you can analyze in different ways. The table below shows some common uses for the metrics. These are suggestions to get you started, not a comprehensive list. For the complete list of metrics reported by Amazon EMR, see Metrics Reported by Amazon EMR in CloudWatch.

How do I?Relevant Metrics
Track the progress of my cluster Look at the RunningMapTasks, RemainingMapTasks, RunningReduceTasks, and RemainingReduceTasks metrics.
Detect clusters that are idle The IsIdle metric tracks whether a cluster is live, but not currently running tasks. You can set an alarm to fire when the cluster has been idle for a given period of time, such as thirty minutes.
Detect when a node runs out of storage The HDFSUtilization metric is the percentage of disk space currently used. If this rises above an acceptable level for your application, such as 80% of capacity used, you may need to resize your cluster and add more core nodes.

Access CloudWatch Metrics

There are many ways to access the metrics that Amazon EMR pushes to CloudWatch. You can view them through either the Amazon EMR console or CloudWatch console, or you can retrieve them using the CloudWatch CLI or the CloudWatch API. The following procedures show you how to access the metrics using these various tools.

To view metrics in the Amazon EMR console

  1. Sign in to the AWS Management Console and open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.

  2. To view metrics for a cluster, click a cluster to display the Summary pane.

  3. Select the Monitoring tab to view information about that cluster. Click any one of the tabs named Cluster Status, Map/Reduce, Node Status, IO, or HBase to load the reports about the progress and health of the cluster.

  4. After you choose a metric to view, click the Time range field to filter the metrics to a specific time frame.

    Metrics Alarm Tutorial

To view metrics in the CloudWatch console

  1. Sign in to the AWS Management Console and open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/.

  2. In the navigation pane, click EMR.

  3. Scroll down to the metric to graph. You can search on the cluster identifier of the cluster to monitor.

    Metrics Alarm Tutorial

  4. Click a metric to display the graph.

    Metrics Alarm Tutorial

To access metrics from the CloudWatch CLI

To access metrics from the CloudWatch API

Setting Alarms on Metrics

Amazon EMR pushes metrics to CloudWatch, which means you can use CloudWatch to set alarms on your Amazon EMR metrics. You can, for example, configure an alarm in CloudWatch to send you an email any time the HDFS utilization rises above 80%.

The following topics give you a high-level overview of how to set alarms using CloudWatch. For detailed instructions, see Using CloudWatch in the Amazon CloudWatch Developer Guide.

Set alarms using the CloudWatch console

  1. Sign in to the AWS Management Console and open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/.

  2. Click the Create Alarm button. This launches the Create Alarm Wizard.

    Create Alarm Wizard

  3. Click EMR Metrics and scroll through the Amazon EMR metrics to locate the metric you want to place an alarm on. An easy way to display just the Amazon EMR metrics in this dialog box is to search on the cluster identifier of your cluster. Select the metric to create an alarm on and click Next.

    Create Alarm Wizard

  4. Fill in the Name, Description, Threshold, and Time values for the metric.

    Create Alarm Wizard

  5. If you want CloudWatch to send you an email when the alarm state is reached, in the Whenever this alarm: field, choose State is ALARM. In the Send notification to: field, choose an existing SNS topic. If you select Create topic, you can set the name and email addresses for a new email subscription list. This list is saved and appears in the field for future alarms.

    Note

    If you use Create topic to create a new Amazon SNS topic, the email addresses must be verified before they receive notifications. Emails are only sent when the alarm enters an alarm state. If this alarm state change happens before the email addresses are verified, they do not receive a notification.

    Create Alarm Wizard

  6. At this point, the Define Alarm screen gives you a chance to review the alarm you’re about to create. Click Create Alarm.

Note

For more information about how to set alarms using the CloudWatch console, see Create an Alarm that Sends Email in the Amazon CloudWatch Developer Guide.

To set an alarm using the CloudWatch API

To set an alarm using the CloudWatch API

Metrics Reported by Amazon EMR in CloudWatch

The following table lists all of the metrics that Amazon EMR reports in the console and pushes to CloudWatch.

Amazon EMR Metrics for Hadoop 1 AMIs

Amazon EMR sends data for several metrics to CloudWatch. All Amazon EMR clusters automatically send metrics in five-minute intervals. Metrics are archived for two weeks; after that period, the data is discarded.

Note

Amazon EMR pulls metrics from a cluster. If a cluster becomes unreachable, no metrics are reported until the cluster becomes available again.

MetricDescription
Cluster Status

Is Idle?

Indicates that a cluster is no longer performing work, but is still alive and accruing charges. It is set to 1 if no tasks are running and no jobs are running, and set to 0 otherwise. This value is checked at five-minute intervals and a value of 1 indicates only that the cluster was idle when checked, not that it was idle for the entire five minutes. To avoid false positives, you should raise an alarm when this value has been 1 for more than one consecutive 5-minute check. For example, you might raise an alarm on this value if it has been 1 for thirty minutes or longer.

Use case: Monitor cluster performance

Units: Boolean

Jobs Running

The number of jobs in the cluster that are currently running.

Use case: Monitor cluster health

Units: Count

Jobs Failed

The number of jobs in the cluster that have failed.

Use case: Monitor cluster health

Units: Count

Map/Reduce

Map Tasks Running

The number of running map tasks for each job. If you have a scheduler installed and multiple jobs running, multiple graphs are generated.

Use case: Monitor cluster progress

Units: Count

Map Tasks Remaining

The number of remaining map tasks for each job. If you have a scheduler installed and multiple jobs running, multiple graphs are generated. A remaining map task is one that is not in any of the following states: Running, Killed, or Completed.

Use case: Monitor cluster progress

Units: Count

Map Slots Open

The unused map task capacity. This is calculated as the maximum number of map tasks for a given cluster, less the total number of map tasks currently running in that cluster.

Use case: Analyze cluster performance

Units: Count

Remaining Map Tasks Per Slot

The ratio of the total map tasks remaining to the total map slots available in the cluster.

Use case: Analyze cluster performance

Units: Ratio

Reduce Tasks Running

The number of running reduce tasks for each job. If you have a scheduler installed and multiple jobs running, multiple graphs are generated.

Use case: Monitor cluster progress

Units: Count

Reduce Tasks Remaining

The number of remaining reduce tasks for each job. If you have a scheduler installed and multiple jobs running, multiple graphs are generated.

Use case: Monitor cluster progress

Units: Count

Reduce Slots Open

Unused reduce task capacity. This is calculated as the maximum reduce task capacity for a given cluster, less the number of reduce tasks currently running in that cluster.

Use case: Analyze cluster performance

Units: Count

Node Status

Core Nodes Running

The number of core nodes working. Data points for this metric are reported only when a corresponding instance group exists.

Use case: Monitor cluster health

Units: Count

Core Nodes Pending

The number of core nodes waiting to be assigned. All of the core nodes requested may not be immediately available; this metric reports the pending requests. Data points for this metric are reported only when a corresponding instance group exists.

Use case: Monitor cluster health

Units: Count

Live Data Nodes

The percentage of data nodes that are receiving work from Hadoop.

Use case: Monitor cluster health

Units: Percent

Task Nodes Running

The number of task nodes working. Data points for this metric are reported only when a corresponding instance group exists.

Use case: Monitor cluster health

Units: Count

Task Nodes Pending

The number of core nodes waiting to be assigned. All of the task nodes requested may not be immediately available; this metric reports the pending requests. Data points for this metric are reported only when a corresponding instance group exists.

Use case: Monitor cluster health

Units: Count

Live Task Trackers

The percentage of task trackers that are functional.

Use case: Monitor cluster health

Units: Percent

IO

S3 Bytes Written

The number of bytes written to Amazon S3.

Use case: Analyze cluster performance, Monitor cluster progress

Units: Bytes

S3 Bytes Read

The number of bytes read from Amazon S3.

Use case: Analyze cluster performance, Monitor cluster progress

Units: Bytes

HDFS Utilization

The percentage of HDFS storage currently used.

Use case: Analyze cluster performance

Units: Percent

HDFS Bytes Read

The number of bytes read from HDFS.

Use case: Analyze cluster performance, Monitor cluster progress

Units: Bytes

HDFS Bytes Written

The number of bytes written to HDFS.

Use case: Analyze cluster performance, Monitor cluster progress

Units: Bytes

Missing Blocks

The number of blocks in which HDFS has no replicas. These might be corrupt blocks.

Use case: Monitor cluster health

Units: Count

Total Load

The total number of concurrent data transfers.

Use case: Monitor cluster health

Units: Count

HBase
Backup Failed

Whether the last backup failed. This is set to 0 by default and updated to 1 if the previous backup attempt failed. This metric is only reported for HBase clusters.

Use case: Monitor HBase backups

Units: Count

Most Recent Backup Duration

The amount of time it took the previous backup to complete. This metric is set regardless of whether the last completed backup succeeded or failed. While the backup is ongoing, this metric returns the number of minutes after the backup started. This metric is only reported for HBase clusters.

Use case: Monitor HBase Backups

Units: Minutes

Time Since Last Successful Backup

The number of elapsed minutes after the last successful HBase backup started on your cluster. This metric is only reported for HBase clusters.

Use case: Monitor HBase backups

Units: Minutes

Amazon EMR Metrics for Hadoop 2 AMIs

The following metrics are available for Hadoop 2 AMIs:

MetricDescription
Cluster Status

Is Idle?

Indicates that a cluster is no longer performing work, but is still alive and accruing charges. It is set to 1 if no tasks are running and no jobs are running, and set to 0 otherwise. This value is checked at five-minute intervals and a value of 1 indicates only that the cluster was idle when checked, not that it was idle for the entire five minutes. To avoid false positives, you should raise an alarm when this value has been 1 for more than one consecutive 5-minute check. For example, you might raise an alarm on this value if it has been 1 for thirty minutes or longer.

Use case: Monitor cluster performance

Units: Boolean

Container Allocated

The number of resource containers allocated by the ResourceManager.

Use case: Monitor cluster progress

Units: Count

Container Reserved

The number of containers reserved.

Use case: Monitor cluster progress

Units: Count

Container Pending

The number of containers in the queue that have not yet been allocated.

Use case: Monitor cluster progress

Units: Count

Apps Completed

The number of applications submitted to YARN that have completed.

Use case: Monitor cluster progress

Units: Count

Apps Failed

The number of applications submitted to YARN that have failed to complete.

Use case: Monitor cluster progress, Monitor cluster health

Units: Count

Apps Killed

The number of applications submitted to YARN that have been killed.

Use case: Monitor cluster progress, Monitor cluster health

Units: Count

Apps Pending

The number of applications submitted to YARN that are in a pending state.

Use case: Monitor cluster progress

Units: Count

Apps Running

The number of applications submitted to YARN that are running.

Use case: Monitor cluster progress

Units: Count

Apps Submitted

The number of applications submitted to YARN.

Use case: Monitor cluster progress

Units: Count

Node Status

Core Nodes Running

The number of core nodes working. Data points for this metric are reported only when a corresponding instance group exists.

Use case: Monitor cluster health

Units: Count

Core Nodes Pending

The number of core nodes waiting to be assigned. All of the core nodes requested may not be immediately available; this metric reports the pending requests. Data points for this metric are reported only when a corresponding instance group exists.

Use case: Monitor cluster health

Units: Count

Live Data Nodes

The percentage of data nodes that are receiving work from Hadoop.

Use case: Monitor cluster health

Units: Percent

MR Total Nodes

The number of nodes presently available to MapReduce jobs.

Use ase: Monitor cluster progress

Units: Count

MR Active Nodes

The number of nodes presently running MapReduce tasks or jobs.

Use case: Monitor cluster progress

Units: Count

MR Lost Nodes

The number of nodes allocated to MapReduce that have been marked in a LOST state.

Use case: Monitor cluster health, Monitor cluster progress

Units: Count

MR Unhealthy Nodes

The number of nodes available to MapReduce jobs marked in an UNHEALTHY state.

Use case: Monitor cluster progress

Units: Count

MR Decommissioned Nodes

The number of nodes allocated to MapReduce applications that have been marked in a DECOMMISSIONED state.

Use ase: Monitor cluster health, Monitor cluster progress

Units: Count

MR Rebooted Nodes

The number of nodes available to MapReduce that have been rebooted and marked in a REBOOTED state.

Use case: Monitor cluster health, Monitor cluster progress

Units: Count

IO

S3 Bytes Written

The number of bytes written to Amazon S3.

Use case: Analyze cluster performance, Monitor cluster progress

Units: Bytes

S3 Bytes Read

The number of bytes read from Amazon S3.

Use case: Analyze cluster performance, Monitor cluster progress

Units: Bytes

HDFS Utilization

The percentage of HDFS storage currently used.

Use case: Analyze cluster performance

Units: Percent

HDFS Bytes Read

The number of bytes read from HDFS.

Use case: Analyze cluster performance, Monitor cluster progress

Units: Bytes

HDFS Bytes Written

The number of bytes written to HDFS.

Use case: Analyze cluster performance, Monitor cluster progress

Units: Bytes

Missing Blocks

The number of blocks in which HDFS has no replicas. These might be corrupt blocks.

Use case: Monitor cluster health

Units: Count

Total Load

The total number of concurrent data transfers.

Use case: Monitor cluster health

Units: Count

Memory Total MB

The total amount of memory in the cluster.

Use case: Monitor cluster progress

Units: Bytes

Memory Reserved MB

The amount of memory reserved.

Use case: Monitor cluster progress

Units: Bytes

Memory Available MB

The amount of memory available to be allocated.

Use case: Monitor cluster progress

Units: Bytes

Memory Allocated MB

The amount of memory allocated to the cluster.

Use case: Monitor cluster progress

Units: Bytes

Pending Deletion Blocks

The number of blocks marked for deletion.

Use case: Monitor cluster progress, Monitor cluster health

Units: Count

Under Replicated Blocks

The number of blocks that need to be replicated one or more times.

Use case: Monitor cluster progress, Monitor cluster health

Units: Count

Dfs Pending Replication Blocks

The status of block replication: blocks being replicated, age of replication requests, and unsuccessful replication requests.

Use case: Monitor cluster progress, Monitor cluster health

Units: Count

Capacity Remaining GB

The amount of remaining HDFS disk capacity.

Use case: Monitor cluster progress, Monitor cluster health

Units: Bytes

HBase
Backup Failed

Whether the last backup failed. This is set to 0 by default and updated to 1 if the previous backup attempt failed. This metric is only reported for HBase clusters.

Use case: Monitor HBase backups

Units: Count

Most Recent Backup Duration

The amount of time it took the previous backup to complete. This metric is set regardless of whether the last completed backup succeeded or failed. While the backup is ongoing, this metric returns the number of minutes after the backup started. This metric is only reported for HBase clusters.

Use case: Monitor HBase Backups

Units: Minutes

Time Since Last Successful Backup

The number of elapsed minutes after the last successful HBase backup started on your cluster. This metric is only reported for HBase clusters.

Use case: Monitor HBase backups

Units: Minutes

Dimensions for Amazon EMR Metrics

Amazon EMR data can be filtered using any of the dimensions in the following table.

Dimension Description
ClusterId The identifier for a cluster. You can find this value by clicking on the cluster in the Amazon EMR console. It takes the form j-XXXXXXXXXXXXX.
JobId The identifier of a job within a cluster. You can use this to filter the metrics returned from a cluster down to those that apply to a single job within the cluster. JobId takes the form job_XXXXXXXXXXXX_XXXX.