Elastic Load Balancing
Developer Guide (API Version 2012-06-01)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Monitor Your Load Balancer Using Amazon CloudWatch

Elastic Load Balancing publishes data points to Amazon CloudWatch about your load balancers and your back-end application instances. CloudWatch allows you to retrieve statistics about those data points as an ordered set of time-series data, known as metrics. Think of a metric as a variable to monitor, and the data points represent the values of that variable over time. Each data point has an associated time stamp and (optionally) a unit of measurement. For example, total number of healthy EC2 instances behind a load balancer over a specified time period can be a metric.

Amazon CloudWatch provides statistics based on the metric data points published by Elastic Load Balancing. Statistics are metric data aggregations over specified periods of time. The following statistics are available: Minimum (min), Maximum (max), Sum, Average, and Count. When you request statistics, the returned data stream is identified by the metric name and a dimension. A dimension is a name/value pair that helps you to uniquely identify a metric. For example, you can request statistics of all the healthy EC2 instances behind a load balancer launched in a specific Availability Zone.

One purpose for monitoring metrics in Amazon CloudWatch is to verify that your system is performing as expected. If a metric goes outside what you consider an acceptable range, you can create a CloudWatch alarm. A CloudWatch alarm watches over a specified metric and initiates an action if the metric goes outside of the specified range. An action can be a notification sent to you.

For more information on Amazon CloudWatch, see What is Amazon CloudWatch.

This section provides information about the metrics that Elastic Load Balancing sends to Amazon CloudWatch. This section also explains how to view and use the metrics, and how you can set alarms when a metric meets a condition that you specify.

Available Metrics

Elastic Load Balancing sends metrics for all of the load balancers associated with your AWS account to Amazon CloudWatch. By default, CloudWatch uses these metrics to provide detailed monitoring of your load balancers. You do not need to specifically enable detailed monitoring.

Elastic Load Balancing Metrics

The following Elastic Load Balancing metrics are available from Amazon CloudWatch.

Elastic Load Balancing only reports when requests are flowing through the load balancer. If there are no requests or data for a given metric, the metric will not be reported to CloudWatch. If there are requests flowing through the load balancer, Elastic Load Balancing will measure and send metrics for that load balancer in 60-second intervals.

Note: The Statistic value available through Amazon CloudWatch, such as Min or Count are not always applicable to every metric. However, they are all available through the console, API, and the command line interface (CLI). For each metric, be aware of the Preferred Statistic for all the Elastic Load Balancing metrics to track useful information.

MetricDescription
HealthyHostCount

The count of the number of healthy instances in each Availability Zone. Hosts are declared healthy if they meet the threshold for the number of consecutive health checks that are successful. Hosts that have failed more health checks than the value of the unhealthy threshold are considered unhealthy.

Preferred statistic: average

UnHealthyHostCount

The count of the number of unhealthy instances in each Availability Zone. Hosts that have failed more health checks than the value of the unhealthy threshold are considered unhealthy. Instances may become unhealthy due to connectivity issues, health checks returning non-200 responses (in the case of HTTP or HTTPS health checks), or timeouts when performing the health check.

Preferred statistic: average

RequestCount

The count of the number of completed requests that were received and routed to the back-end instances.

Preferred statistic: sum

Latency

Measures the time elapsed in seconds after the request leaves the load balancer until the response is received.

Preferred statistic: average

HTTPCode_ELB_4XX

The count of the number of HTTP 4XX client error codes generated by the load balancer when the listener is configured to use HTTP or HTTPS protocols. Client errors are generated when a request is malformed or is incomplete.

Preferred statistic: sum

HTTPCode_ELB_5XX

The count of the number of HTTP 5XX server error codes generated by the load balancer when the listener is configured to use HTTP or HTTPS protocols. This metric does not include any responses generated by back-end instances.

The metric is reported if there are no back-end instances that are healthy or registered to the load balancer, or if the request rate exceeds the capacity of the instances or the load balancers.

Preferred statistic: sum

HTTPCode_Backend_2XX

HTTPCode_Backend_3XX

HTTPCode_Backend_4XX

HTTPCode_Backend_5XX

The count of the number of HTTP response codes generated by back-end instances. This metric does not include any response codes generated by the load balancer.

The 2XX class status codes represent successful actions. The 3XX class status code indicates that the user agent requires action. The 4XX class status code represents client errors. The 5XX class status code represents back-end server errors.

Preferred statistic: sum

BackendConnectionErrors

The count of the number of connections that were not successfully established between the load balancer and the registered instances. Because the load balancer will retry when there are connection errors, this count can exceed the request rate.

Preferred statistic: sum

SurgeQueueLength

A count of the total number of requests that are pending submission to a registered instance.

Preferred statistic: max

SpilloverCount

A count of the total number of requests that were rejected due to the queue being full.

Preferred statistic: sum

Dimensions for Elastic Load Balancing Metrics

To refine the metrics returned by a query, you can use the dimensions for Elastic Load Balancing that are listed in the table in this section. For example, with the HealthyHostCount metric, you can use the dimensions LoadBalancerName and AvailabilityZone to get the average number of healthy instances behind the specified load balancer within the specified Availability Zone for a given period of time. Alternatively, it may be useful to track the minimum number of healthy hosts or the maximum number of unhealthy hosts to better understand how the health and the count of backend instances change over time.

Elastic Load Balancing data can be aggregated along any of the following dimensions shown in the following table.

Dimension

Description

LoadBalancerName

Limits the metric data to Amazon EC2 instances that are connected to the specified load balancer.

AvailabilityZone

Limits the metric data to load balancers in the specified Availability Zone.

How Elastic Load Balancing Statistics Are Measured

Elastic Load Balancing is made up of load balancer nodes that forward traffic to your back-end instances. Each load balancer node is designated to route traffic to instances within a single Availability Zone and each load balancer node reports metrics for the designated Availability Zone.

For all CloudWatch metrics, min and max represent the min and max as reported by individual load balancer nodes. For example, say there are 2 load balancer nodes. Node 1 reports a HealthyHostCount with a min of 2, a max of 10, and an average of 6. Load balancer node 2 reports a HealthyHostCount with a min of 1, max of 5, and an average of 3. This means that the min in CloudWatch will be 1, the max will be 10, and the average will be around 4.

The sum is the aggregate value reported across all load balancer nodes during the given time period. Because the metrics include multiple reports per period, sum is only applicable to metrics that are aggregated across all load balancer nodes, such as RequestCount, HTTPCode_ELB_4XX, HTTPCode_ELB_5XX, HTTPCode_Backend_2XX,3XX,4XX,5XX, BackendConnectionErrors, and SpilloverCount.

The count is the number of samples measured. Because the metrics are gathered based on sampling intervals and events, the count is typically not useful. For instance, in the healthy/unhealthy host metrics, count is based on the number of health check samples collected by the load balancer nodes, not the number of healthy/unhealthy hosts. For latency metrics, count is the number of samples that each load balancer node reports, not the actual value of latency reported.

The following table describes how to evaluate the statistics for the CloudWatch metrics sent by Elastic Load Balancing.

Statistics for Elastic Load Balancing Metrics

Metric NameStatistics Details

HealthyHostCount

UnHealthyHostCount

This metric should be used with AvailabilityZone dimension.

Preferred statistic: average

The average statistic indicates the average number of healthy or unhealthy instances as seen by the load balancer nodes. To get the total healthy or unhealthy instances, calculate the average value for each Availability Zone (AZ).

Min and max represent the least and the most number of instances that were in the healthy or unhealthy state during the specified interval. Because some load balancer nodes may see an instance as unhealthy for a brief period while other nodes see the instance as healthy, the min and max can be misleading.

Sum is not a meaningful statistic for this measure.

Count is the number of samples reported by all nodes and is not a useful measure for troubleshooting issues.

Example: Say that your load balancer has 4 back-end instances with 2 instances in AZ-1 and 2 instances in AZ-2. AZ-1 has 1 unhealthy instance and 1 healthy instance, and AZ-2 has 2 healthy and 0 unhealthy instances. The AZ dimension will report an average of 1 healthy and 1 unhealthy instance in AZ-1, and an average of 2 healthy and 0 unhealthy instances in AZ-2. The regional (load balancer) dimension would report an average of 1.5 healthy instances in AZ-1 and in AZ-2 and 0.5 unhealthy instances in AZ-1 and AZ-2.

RequestCount

This metric is typically used with the LoadBalancerName dimension to view the total requests for a load balancer. The metrics can also be used to measure the number of requests that were routed to an Availability Zone (which may or may not be the same Availability Zone that serviced the request for the back end).

Preferred statistic: sum

Sum is the only meaningful statistic for this measure.

Min, max, and average are not meaningful because all return a value of 1.

Count is the number of samples reported by all load balancer nodes and typically equals the sum for the period.

Example: Say your load balancer has 4 back-end instances with 2 instances in AZ-1 and 2 instances in AZ-2. 100 requests are sent to the load balancer; 60 requests are sent to AZ-1 with each instance receiving 30 requests. 40 requests are sent to AZ-2 with each instance receiving 20 requests. The AZ dimension will report a sum of 60 requests in AZ-1 and 40 requests in AZ-2. The regional (load balancer) dimension reports a sum of 100 requests. Instances in AZ-1 would each see a request count of 30; instances in AZ-2 would each see a request count of 20.

Latency

Latency can be viewed for all requests or for requests routed to a single Availability Zone.

Preferred statistic: average

The average provides the most useful diagnostic because it is the average of all requests that were sent to the back end.

Max can be useful to determine if some requests are taking substantially longer than the average.

Min is typically not a useful measure, because it is the request/response with the lowest total time elapsed.

Count is approximately equal to the sum of the RequestCount metric, because it is the number of samples taken.

Example: Say your load balancer has 4 back-end instances with 2 instances in AZ-1 and 2 instances in AZ-2. Requests sent to 1 instance in AZ-1 have a higher latency. The latency metric reported for AZ-1 will have a higher value than the latency metric for AZ-2.

HTTPCode_ELB_4XX

This metric is typically used with the LoadBalancerName dimension to view the total number of HTTP 4XX errors generated by all the load balancer nodes. The metrics can also be used to measure the number of errors for requests that were routed to an Availability Zone (which may or may not be the same Availability Zone that serviced the request for the back-end).

Preferred statistic: sum

Sum is the only meaningful statistic for this measure.

Min, max, and average are not meaningful and all return a value of 1.

Count is the number of samples reported by all load balancer nodes and typically equals the sum for the period.

Example: Say that your load balancer has AZ-1 and AZ-2 enabled. Client requests include a malformed request URL. Client HTTP errors would likely increase in all Availability Zones. The regional (load balancer) metric would be the sum of the values for each Availability Zone.

HTTPCode_ELB_5XX

This metric is typically used with the LoadBalancerName dimension to view the total number of HTTP 5XX errors generated by all the load balancer nodes. The metrics can also be used to measure the number of errors for requests that were routed to an Availability Zone (which may or may not be the same Availability Zone that serviced the request for the back end).

Preferred statistic: sum

Sum is the only meaningful statistic for this measure.

Min, max, and average are not meaningful statistics for this measure and all return a value of 1.

Count is the number of samples reported by all load balancer nodes and typically equals the sum for the period.

Example: Say that your load balancer has AZ-1 and AZ-2 enabled. Instances in AZ-1 are experiencing high latency and are slow to respond to requests. As a result the load balancer nodes' surge queue in AZ-1 fills up, resulting in spillovers and returns a 503 error to clients. (see SpilloverCount metric in this table). Assuming AZ-2 continues to respond normally, the regional (load balancer) sum of HTTPCode_ELB_5XX equals the sum for AZ-1.

HTTPCode_Backend_2XX

HTTPCode_Backend_3XX

HTTPCode_Backend_4XX

HTTPCode_Backend_5XX

This metric is typically used with the LoadBalancerName dimension to view the total number of HTTP response codes generated by back-end instances registered to a load balancer. The metrics can also be used to measure the number of errors for requests that were routed to an Availability Zone (which may or may not be the same Availability Zone that serviced the request for the back end).

Preferred statistic: sum

Sum is the only meaningful statistic for this measure.

Min, max, and average are not meaningful statistics for this measure and all return a value of 1.

Count is the number of samples reported by all load balancer nodes and typically equals the sum for the period.

Example: Say that your load balancer has 4 back-end instances with 2 instances in AZ-1 and 2 instances in AZ-2. Requests sent to 1 instance in AZ-1 result in an HTTP 500 response. The metric reported for AZ-1 will include these error responses, while the value in AZ-2 will not include these error responses. The regional (load balancer) total would be equal to the total for AZ-1.

BackendConnectionErrors

This metric can be viewed for all back-end instances or the back-end instances in a single Availability Zone.

Preferred statistic: sum

Sum represents the total connection errors seen by all load balancer nodes for the given period of time.

Sum is the only meaningful statistic for this measure.

Average, min, and max are reported per load balancer node and are not typically useful. The difference between min and max or peak to average or average to trough may be useful to determine if a single load balancer node is an outlier.

Count is the number of samples reported by all load balancer nodes and will typically equal the sum for the period.

Example: Say that your load balancer has 4 back-end instances with 2 instances in AZ-1 and 2 instances in AZ-2. Attempts to connect to 1 instance in AZ-1 result in an increase in the back end connection errors. The metric reported for AZ-1 will include these connection errors for the failed attempts, while the value in AZ-2 will not include these errors. The regional (load balancer) total would be equal to the total for AZ-1.

SurgeQueueLength

This metric can be used to monitor the surge queue size for a single Availability Zone or for the region (the overall load balancer).

Preferred statistic: max

Max is the most useful statistic because it represents the peak of requests that were queued. The maximum value is 1,024. If any load balancer node has a full queue of 1,024 requests there will likely be spillovers (see the SpilloverCount metric in this table).

Average represents the number of requests that were in the queue on average across all load balancer nodes for a given period of time. Average can be useful in combination with min and max to determine the range of queuing by load balancer nodes.

Sum is not a useful statistic, because it is the total of all recorded samples across all load balancer nodes.

Min is not typically useful because it represents the lowest value observed for any load balancer node for any sample.

Count is not a meaningful statistic for this measure.

Example: Say that your load balancer has AZ-1 and AZ-2 enabled. Instances in AZ-1 are experiencing high latency and are slow to respond to requests. As a result the load balancers' surge queue in AZ-1 fills up, with clients likely experiencing increased response times. If this continues, the load balancer will have spillovers (see Spillover metrics in this table). Assuming AZ-2 continues to respond normally, the regional (load balancer) max will be the same as the max for AZ-1, while AZ-2 will have a small value (or no metric) for the same period.

SpilloverCount

This metric is typically used with the LoadBalancerName dimension to view the total spillovers for the load balancer. This metric can also be used to measure the number of rejected requests that were routed to an Availability Zone.

Preferred statistic: sum

Sum represents the total of all load balancer nodes reports for the given period of time.

Sum is the only meaningful statistic for this measure.

Average, min, and max are reported per load balancer node and are not typically useful for the spillover metric.

Min is not typically useful because it represents the lowest value observed for any load balancer node for any sample.

Count is not a meaningful statistic for this measure.

Example: Say that your load balancer has AZ-1 and AZ-2 enabled. Instances in AZ-1 are experiencing high latency and are slow to respond to requests. As a result the load balancers' surge queue in AZ-1 fills up, resulting in load balancer spillovers. The spillover metric will be incremented. Assuming AZ-2 continues to respond normally, the regional (load balancer) sum will be the same as the sum for AZ-1, while AZ-2 will have a small value (or no metric) for the same period.

View Metrics

You can view the CloudWatch metrics for your Elastic Load Balancing load balancers directly in the EC2 console’s Load Balancer page. The load balancer metrics are displayed as monitoring graphs. The following metrics are available in the EC2 console:

  • Average Latency

  • Sum Requests

  • Sum ELB HTTP 4XXs

  • Sum ELB HTTP 5XXs

  • Sum HTTP 2XXs

  • Sum HTTP 4XXs

  • Sum HTTP 5XXs

To view the metrics from your load balancer

  1. Sign in to the AWS Management Console and open the Amazon EC2 console at https://console.aws.amazon.com/ec2/.

  2. On the Amazon EC2 console Resources page, in the EC2 Dashboard pane, under NETWORK & SECURITY, click Load Balancers.

  3. On the Load Balancers page, select the check box next to your load balancer.

    The bottom pane displays the details of your load balancer.

  4. Click the Monitoring tab.

    The bottom pane displays the graphs of metrics sent by Elastic Load Balancing for the selected load balancer.

  5. If you want to filter the results by time, select the time range in the Showing data for drop-down box.

    Monitoring Tab

  6. Click on an individual graph to get a larger view of the individual metric.

The my-test-loadbalancer load balancer used in this example is inactive and is used for display purposes only. The monitoring graphs will show data points if the load balancer is active and receiving requests.

You can also view metrics for your load balancer using the Amazon CloudWatch console, command line interface (CLI), or the Query API. For information on using the CloudWatch console to view metrics, see Viewing Your AWS Metrics with Amazon CloudWatch. To view metrics using the command line interface, use the mon-list-metrics command. To view metrics using the Query API, use the ListMetrics action.

Create Alarms

An alarm watches a single metric over a time period you specify. Depending on the value of the metric relative to a threshold that you define, the alarm can send one or more notifications to an Amazon Simple Notification Service (Amazon SNS) topic. Amazon SNS is a web service that enables applications, end users, and devices to instantly send and receive notifications. For more information, see Get Started with Amazon SNS.

An alarm will send notifications to Amazon SNS when the specified metric reaches the defined range and remains in that range for a specified period of time. An alarm has three possible states:

  • OK—This is the state the alarm is in when the value of the metric remains within the range you’ve specified.

  • ALARM—This is the state the alarm goes to when the value of the metric goes out of the range you’ve specified and remains outside of the range for a specified time duration.

  • INSUFFICIENT_DATA—When the alarm is in this state, it either means that the metric is not yet available or not enough data is available for the metric to determine the alarm state.

Whenever the state of an alarm changes, Amazon CloudWatch uses Amazon SNS to send a notification to the email addresses that you specify.

You can create alarms for your load balancer using either the Elastic Load Balancing wizard in the EC2 console or the Amazon CloudWatch console. This section walks you through the steps for creating an alarm using the Elastic Load Balancing wizard. For information on creating a load balancer alarm using the Amazon CloudWatch console, see Send Email Based on Load Balancer Alarm.

To create an alarm for your load balancer

  1. Sign in to the AWS Management Console and open the Amazon EC2 console at https://console.aws.amazon.com/ec2/.

  2. On the Amazon EC2 console page, in the EC2 Dashboard pane, under NETWORK & SECURITY, click Load Balancers.

  3. On the Load Balancers page, select the check box next to the load balancer for which you want to create an alarm.

    The bottom pane displays the details of your load balancer.

  4. On the Monitoring tab in the bottom pane, click Create Alarm.

  5. In the Create Alarm dialog box, set the criteria for your alarm. In this example, we’ll set an alarm if the load balancer's latency is above 120 seconds for 1 consecutive period of 5 minutes.

  6. The check box next to Send a notification to is selected by default.

    If you've already created a SNS topic and want to use it, select your topic from the drop-down box. Skip the next step.

    If you do not have a SNS topic, click create topic.

  7. To create SNS topic

    1. In the Send a notification to: box, enter a name for your topic.

    2. In the With these recipients box, enter the email addresses of the recipients you want to notify. You can enter up to 10 email addresses, each separated by a comma.

      Note

      If your email recipients have not yet been subscribed to the SNS topic, each email recipient will receive an email from Amazon SNS with a link to confirm their subscription to the SNS topic. If they do not confirm subscription, they will not receive future email Alarm notifications.

  8. Configure the threshold for your alarm.

    1. In the Whenever boxes, select Average and Average Latency.

    2. In the Is boxes, define the threshold for the alarm by selecting > and entering 120.

    3. In the For at least boxes, enter 1 and then select 5 minutes, or you can define your own evaluation period.

      Note

      A shorter period creates a more sensitive alarm. A longer period can mitigate brief spikes in a metric.

    4. In Name this alarm, a name is automatically generated for you. Enter a new name if you want to change it.

      Important

      You cannot modify the name after you create the alarm.

      Create Alarm

  9. Click Create Alarm.

  10. The Alarm created successfully dialog box opens confirming successful creation of your alarm. Click Close.

After you create the alarm, you can use the Monitoring tab to view a summary of alarms that have been set for that load balancer. From there, you can also edit the alarm.