Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Monitor Metrics with Amazon CloudWatch

When you’re running a cluster, you often want to track its progress and health. Amazon EMR records metrics that can help you monitor your cluster. It makes these metrics available in the Amazon EMR console and in the Amazon CloudWatch console, where you can track them with your other AWS metrics. In Amazon CloudWatch, you can set alarms to warn you if a metric goes outside parameters you specify.

Metrics are updated every five minutes. This interval is not configurable. Metrics are archived for two weeks; after that period, the data is discarded.

These metrics are automatically collected and pushed to Amazon CloudWatch for every Amazon EMR cluster. There is no charge for the Amazon EMR metrics reported in Amazon CloudWatch; they are provided as part of the Amazon EMR service.

Note

Viewing Amazon EMR metrics in Amazon CloudWatch is supported only for clusters launched with AMI 2.0.3 or later and running Hadoop 0.20.205 or later. For more information about selecting the AMI version for your cluster, see Choose a Machine Image .

The following video walks you through the metrics that Amazon EMR provides in the Amazon EMR console.

How Do I Use Amazon EMR Metrics?

The metrics reported by Amazon EMR provide information that you can analyze in different ways. The table below shows some common uses for the metrics. These are suggestions to get you started, not a comprehensive list. For the complete list of metrics reported by Amazon EMR, see Metrics Reported by Amazon EMR in Amazon CloudWatch.

How do I?Relevant Metrics
Track the progress of my cluster Look at the RunningMapTasks, RemainingMapTasks, RunningReduceTasks, and RemainingReduceTasks metrics.
Detect clusters that are idle The IsIdle metric tracks whether a cluster is live, but not currently running tasks. You can set an alarm to fire when the cluster has been idle for a given period of time, such as thirty minutes.
Detect when a node runs out of storage The HDFSUtilization metric is the percentage of disk space currently used. If this rises above an acceptable level for your application, such as 80% of capacity used, you may need to resize your cluster and add more core nodes.

Access Amazon CloudWatch Metrics

There are many ways to access the metrics that Amazon EMR pushes to Amazon CloudWatch. You can view them through either the Amazon EMR console or Amazon CloudWatch console, or you can retrieve them using the Amazon CloudWatch CLI or the Amazon CloudWatch API. The following procedures show you how to access the metrics using these various tools.

To view metrics in the Amazon EMR console

  1. Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/.

  2. To view metrics for a cluster, click on the cluster to display the Job Flow Details pane.

    Metrics Alarm Tutorial

  3. Select the Monitoring tab to view information about that cluster. This loads the pane with reports about the progress and health of the cluster.

    Monitoring Tab

To view metrics in the Amazon CloudWatch console

  1. Sign in to the AWS Management Console and open the Amazon CloudWatch console at https://console.aws.amazon.com/cloudwatch/.

  2. In the navigation pane, click the All Metrics.

  3. Scroll down to the metric to graph. You can search on the cluster identifier of the cluster to monitor.

    Metrics Alarm Tutorial

  4. Click a metric to display the graph.

    Metrics Alarm Tutorial

To access metrics from the Amazon CloudWatch CLI

To access metrics from the Amazon CloudWatch API

Setting Alarms on Metrics

Amazon EMR pushes metrics to Amazon CloudWatch, which means you can use Amazon CloudWatch to set alarms on your Amazon EMR metrics. You can, for example, configure an alarm in Amazon CloudWatch to send you an email any time the HDFS utilization rises above 80%.

The following topics give you a high-level overview of how to set alarms using Amazon CloudWatch. For detailed instructions, see Using Amazon CloudWatch in the Amazon CloudWatch Developer Guide.

The following video walks you through the process of setting an alarm on an Amazon EMR metric using the Amazon CloudWatch console.

Set alarms using the Amazon CloudWatch console

  1. Sign in to the AWS Management Console and open the Amazon CloudWatch console at https://console.aws.amazon.com/cloudwatch/.

  2. Click the Create Alarm button. This launches the Create Alarm Wizard.

    Create Alarm Wizard

  3. Scroll through the Amazon EMR metrics to locate the metric you want to place an alarm on. An easy way to display just the Amazon EMR metrics in this dialog box is to search on the cluster identifier of your cluster. Select the metric to create an alarm on and click Continue.

    Create Alarm Wizard

  4. Fill in the Name, Description, Threshold, and Time values for the metric, and click Continue.

    Create Alarm Wizard

  5. Choose Alarm as the alarm state. If you want Amazon CloudWatch to send you an email when the alarm state is reached, choose either a pre-existing Amazon SNS email subscription list or Create New Email Topic. If you select Create New Email Topic, you can set the name and email addresses for a new email subscription list. This list is saved and appears in the drop-down box for future alarms. Click Continue.

    Note

    If you use Create New Email Topic to create a new Amazon SNS topic, the email addresses must be verified before they receive notifications. Emails are only sent when the alarm enters an alarm state. If this alarm state change happens before the email addresses are verified, they do not receive a notification.

    Create Alarm Wizard

  6. At this point, the Create Alarm Wizard gives you a chance to review the alarm you’re about to create. If you need to make any changes, you can use the Edit links on the right. Click Create Alarm.

    Create Alarm Wizard

Note

For more information about how to set alarms using the Amazon CloudWatch console, see Create an Alarm that Sends Email in the Amazon CloudWatch Developer Guide.

To set an alarm using the Amazon CloudWatch API

To set an alarm using the Amazon CloudWatch API

Metrics Reported by Amazon EMR in Amazon CloudWatch

The following table lists all of the metrics that Amazon EMR reports in the Amazon EMR console and pushes to Amazon CloudWatch.

Amazon EMR Metrics

Amazon EMR sends data for several metrics to Amazon CloudWatch. All Amazon EMR clusters automatically send metrics in five-minute intervals. Metrics are archived for two weeks; after that period, the data is discarded.

Note

Amazon EMR pulls metrics from a cluster. If a cluster becomes unreachable, no metrics will be reported until the cluster becomes available again.

MetricDescription

CoreNodesPending

The number of core nodes waiting to be assigned. All of the core nodes requested may not be immediately available; this metric reports the pending requests. Data points for this metric are reported only when a corresponding instance group exists.

Use Case: Monitor cluster health

Units: Count

CoreNodesRunning

The number of core nodes working. Data points for this metric are reported only when a corresponding instance group exists.

Use Case: Monitor cluster health

Units: Count

HBaseBackupFailed

Whether the last backup failed. This is set to 0 by default and updated to 1 if the previous backup attempt failed. This metric is only reported for HBase clusters.

Use Case: Monitor HBase backups

Units: Count

HBaseMostRecentBackupDuration

The amount of time it took the previous backup to complete. This metric is set regardless of whether the last comppleted backup succeeded or failed. While the backup is ongoing, this metric returns the number of minutes since the backup started. This metric is only reported for HBase clusters.

Use Case: Monitor HBase Backups

Units: Minutes

HBaseTimeSinceLastSuccessfulBackup

The number of elapsed minutes since the last successful HBase backup started on your cluster. This metric is only reported for HBase clusters.

Use Case: Monitor HBase backups

Units: Minutes

HDFSBytesRead

The number of bytes read from HDFS.

Use Case: Analyze cluster performance, Monitor cluster progress

Units: Count

HDFSBytesWritten

The number of bytes written to HDFS.

Use Case: Analyze cluster performance, Monitor cluster progress

Units: Count

HDFSUtilization

The percentage of HDFS storage currently used.

Use Case: Analyze cluster performance

Units: Percent

IsIdle

Indicates that a cluster is no longer performing work, but is still alive and accruing charges. It is set to 1 if no tasks are running and no jobs are running, and set to 0 otherwise. This value is checked at five-minute intervals and a value of 1 indicates only that the cluster was idle when checked, not that it was idle for the entire five minutes. To avoid false positives, you should raise an alarm when this value has been 1 for more than one consecutive 5-minute check. For example, you might raise an alarm on this value if it has been 1 for thirty minutes or longer.

Use Case: Monitor cluster performance

Units: Count

JobsFailed

The number of jobs in the cluster that have failed.

Use Case: Monitor cluster health

Units: Count

JobsRunning

The number of jobs in the cluster that are currently running.

Use Case: Monitor cluster health

Units: Count

LiveDataNodes

The percentage of data nodes that are receiving work from Hadoop.

Use Case: Monitor cluster health

Units: Percent

LiveTaskTrackers

The percentage of task trackers that are functional.

Use Case: Monitor cluster health

Units: Percent

MapSlotsOpen

The unused map task capacity. This is calculated as the maximum number of map tasks for a given cluster, less the total number of map tasks currently running in that cluster.

Use Case: Analyze cluster performance

Units: Count

MissingBlocks

The number of blocks in which HDFS has no replicas. These might be corrupt blocks.

Use Case: Monitor cluster health

Units: Count

ReduceSlotsOpen

Unused reduce task capacity. This is calculated as the maximum reduce task capacity for a given cluster, less the number of reduce tasks currently running in that cluster.

Use Case: Analyze cluster performance

Units: Count

RemainingMapTasks

The number of remaining map tasks for each job. If you have a scheduler installed and multiple jobs running, multiple graphs are generated. A remaining map task is one that is not in any of the following states: Running, Killed, or Completed.

Use Case: Monitor cluster progress

Units: Count

RemainingMapTasksPerSlot

The ratio of the total map tasks remaining to the total map slots available in the cluster.

Use Case: Analyze cluster performance

Units: Ratio

RemainingReduceTasks

The number of remaining reduce tasks for each job. If you have a scheduler installed and multiple jobs running, multiple graphs are generated.

Use Case: Monitor cluster progress

Units: Count

RunningMapTasks

The number of running map tasks for each job. If you have a scheduler installed and multiple jobs running, multiple graphs will be generated.

Use Case: Monitor cluster progress

Units: Count

RunningReduceTasks

The number of running reduce tasks for each job. If you have a scheduler installed and multiple jobs running, multiple graphs are generated.

Use Case: Monitor cluster progress

Units: Count

S3BytesRead

The number of bytes read from Amazon S3.

Use Case: Analyze cluster performance, Monitor cluster progress

Units: Count

S3BytesWritten

The number of bytes written to Amazon S3.

Use Case: Analyze cluster performance, Monitor cluster progress

Units: Count

TaskNodesPending

The number of core nodes waiting to be assigned. All of the task nodes requested may not be immediately available; this metric reports the pending requests. Data points for this metric are reported only when a corresponding instance group exists.

Use Case: Monitor cluster health

Units: Count

TaskNodesRunning

The number of task nodes working. Data points for this metric are reported only when a corresponding instance group exists.

Use Case: Monitor cluster health

Units: Count

TotalLoad

The total number of concurrent data transfers.

Use Case: Monitor cluster health

Units: Count

Dimensions for Amazon EMR Metrics

Amazon EMR data can be filtered using any of the dimensions in the following table.

Dimension Description
JobFlowId The identifier for a cluster. You can find this value by clicking on the cluster in the Amazon EMR console. It takes the form j-XXXXXXXXXXXXX.
JobId The identifier of a job within a cluster. You can use this to filter the metrics returned from a cluster down to those that apply to a single job within the cluster. JobId takes the form job_XXXXXXXXXXXX_XXXX.