Elastic Load Balancing
Developer Guide (API Version 2012-06-01)
Did this page help you?  Yes | No |  Tell us about it...
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.

Monitor Your Load Balancer Using Amazon CloudWatch

Elastic Load Balancing publishes data points to Amazon CloudWatch about your load balancers and your back-end instances. CloudWatch enables you to retrieve statistics about those data points as an ordered set of time-series data, known as metrics. Think of a metric as a variable to monitor, and the data points as the values of that variable over time. Each data point has an associated time stamp and (optionally) a unit of measurement. For example, total number of healthy EC2 instances behind a load balancer over a specified time period can be a metric.

CloudWatch provides statistics based on the metric data points published by Elastic Load Balancing. Statistics are metric data aggregations over specified periods of time. The following statistics are available: Minimum (min), Maximum (max), Sum, Average, and Count. When you request statistics, the returned data stream is identified by the metric name and a dimension. A dimension is a name/value pair that uniquely identifies a metric. For example, you can request statistics for all the healthy EC2 instances behind a load balancer launched in a specific Availability Zone.

One purpose for monitoring metrics is to verify that your system is performing as expected. If a metric goes outside what you consider an acceptable range, you can create a CloudWatch alarm to watch over a specified metric and initiate an action (for example, send a notification in email) if the metric goes outside of the specified range.

For more information about Amazon CloudWatch, see the Amazon CloudWatch Developer Guide.

CloudWatch Metrics for Elastic Load Balancing

Elastic Load Balancing sends metrics for all load balancers associated with your AWS account to Amazon CloudWatch. By default, CloudWatch uses these metrics to provide detailed monitoring for your load balancers; you do not need to explicitly enable detailed monitoring.

Elastic Load Balancing Metrics

Elastic Load Balancing reports metrics to CloudWatch only when requests are flowing through the load balancer. If there are requests flowing through the load balancer, Elastic Load Balancing measures and sends its metrics in 60-second intervals. If there are no requests flowing through the load balancer or no data for a metric, the metric is not reported.

Note that not every statistic available through CloudWatch applies to every metric for Elastic Load Balancing, though they are all available. For each metric, be aware of its preferred statistic so that you can track the most useful information.

The following CloudWatch metrics are available for Elastic Load Balancing.

MetricDescription
HealthyHostCount

The number of healthy instances in each Availability Zone. An instance is considered healthy if it meets the healthy threshold configured for the health checks. If cross-zone load balancing is enabled, the number of healthy instances is calculated across all Availability Zones.

Preferred statistic: average

UnHealthyHostCount

The number of unhealthy instances in each Availability Zone. An instance is considered unhealthy if it exceeds the unhealthy threshold configured for the health checks. If cross-zone load balancing is enabled, the number of unhealthy instances is calculated across all Availability Zones.

Preferred statistic: average

RequestCount

The number of completed requests that were received and routed to the registered instances.

Preferred statistic: sum

Latency

The time elapsed, in seconds, after the request leaves the load balancer until a response is received.

Preferred statistic: average

HTTPCode_ELB_4XX

The number of HTTP 4XX client error codes generated by the load balancer when the listener is configured to use the HTTP or HTTPS protocol. Client errors are generated when a request is malformed or incomplete.

Preferred statistic: sum

HTTPCode_ELB_5XX

The number of HTTP 5XX server error codes generated by the load balancer when the listener is configured to use the HTTP or HTTPS protocol. This does not include any response codes generated by registered instances.

The metric is reported if there are no healthy instances registered to the load balancer, or if the request rate exceeds the capacity of the instances or the load balancer.

Preferred statistic: sum

HTTPCode_Backend_2XX

HTTPCode_Backend_3XX

HTTPCode_Backend_4XX

HTTPCode_Backend_5XX

The number of HTTP response codes generated by registered instances. This does not include any response codes generated by the load balancer.

Preferred statistic: sum

BackendConnectionErrors

The number of connections that were not successfully established between the load balancer and the registered instances. Because the load balancer retries the connection when there are errors, this count can exceed the request rate.

Preferred statistic: sum

SurgeQueueLength

The total number of requests that are pending submission to a registered instance.

Preferred statistic: max

SpilloverCount

The total number of requests that were rejected because the queue was full.

Preferred statistic: sum

Dimensions for Elastic Load Balancing Metrics

To refine the metrics returned by a request, you can use the following dimensions supported by Elastic Load Balancing.

Dimension

Description

LoadBalancerName

Limits the metric data to instances that are registered to the specified load balancer.

AvailabilityZone

Limits the metric data to load balancers in the specified Availability Zone.

For example, with the HealthyHostCount metric, you can use the LoadBalancerName and AvailabilityZone dimentions to get the average number of healthy instances for the specified load balancer in the specified Availability Zone for a specific period of time. Alternatively, you can track the minimum number of healthy hosts or the maximum number of unhealthy hosts in order to better understand how the health and the count of your back-end instances change over time.

Statistics for Elastic Load Balancing Metrics

Your load balancer is made up of load balancer nodes that forward traffic to your back-end instances. Each load balancer node routes traffic to instances within a single Availability Zone and reports metrics for the Availability Zone.

For all CloudWatch metrics, min and max represent the min and max as reported by individual load balancer nodes. For example, suppose there are 2 load balancer nodes. Node 1 reports a HealthyHostCount with a min of 2, a max of 10, and an average of 6. Load balancer node 2 reports a HealthyHostCount with a min of 1, max of 5, and an average of 3. This means that the min in CloudWatch is 1, the max is 10, and the average is around 4.

The sum is the aggregate value reported across all load balancer nodes during the given time period. Because the metrics include multiple reports per period, sum is only applicable to metrics that are aggregated across all load balancer nodes, such as RequestCount, HTTPCode_ELB_4XX, HTTPCode_ELB_5XX, HTTPCode_Backend_XXX, BackendConnectionErrors, and SpilloverCount.

The count is the number of samples measured. Because the metrics are gathered based on sampling intervals and events, the count is typically not useful. For example, in the healthy/unhealthy host metrics, count is based on the number of health check samples collected by the load balancer nodes, not the number of healthy/unhealthy hosts. For latency metrics, count is the number of samples that each load balancer node reports, not the actual value of latency reported.

The following table describes how to evaluate the statistics for the CloudWatch metrics sent by Elastic Load Balancing.

Statistics for Elastic Load Balancing Metrics

Metric NameStatistics Details

HealthyHostCount

UnHealthyHostCount

This metric should be used with AvailabilityZone dimension.

Preferred statistic: average

The average statistic indicates the average number of healthy or unhealthy instances seen by the load balancer nodes. To get the total healthy or unhealthy instances, calculate the average value for each Availability Zone.

The min and max statistics represent the smallest and largest number of instances that were in the healthy or unhealthy state during the specified interval. Note that some load balancer nodes might see an instance as unhealthy for a brief period while other nodes see it as healthy, so min and max can be misleading.

The sum statistic is not meaningful.

The count statistic is the number of samples reported by all load balancer nodes, and is not a useful measure for troubleshooting issues.

Example: Suppose that your load balancer has 4 back-end instances with 2 instances in us-west-2a and 2 instances in us-west-2b, that us-west-2a has 1 unhealthy instance and 1 healthy instance, and that us-west-2b has 2 healthy and 0 unhealthy instances. The AZ dimension reports an average of 1 healthy and 1 unhealthy instance in us-west-2a, and an average of 2 healthy and 0 unhealthy instances in us-west-2b. The regional dimension reports an average of 1.5 healthy instances and 0.5 unhealthy instances.

RequestCount

This metric is typically used with the LoadBalancerName dimension to view the total requests for a load balancer. It can also be used to measure the number of requests that were routed to an Availability Zone (which might not be the same Availability Zone that serviced the request for the back end).

Preferred statistic: sum

The sum statistic is the only meaningful statistic for this measure.

The min, max, and average statistics are not meaningful and all return a value of 1.

The count statistic is the number of samples reported by all load balancer nodes and typically equals the sum for the period.

Example: Suppose that your load balancer has 4 back-end instances with 2 instances in us-west-2a and 2 instances in us-west-2b, and that 100 requests are sent to the load balancer. There are 60 requests sent to us-west-2a, with each instance receiving 30 requests, and 40 requests sent to us-west-2b, with each instance receiving 20 requests. The AZ dimension reports a sum of 60 requests in us-west-2a and 40 requests in us-west-2b. The regional dimension reports a sum of 100 requests.

Latency

Latency can be viewed for all requests or for requests routed to a single Availability Zone.

Preferred statistic: average

The average statistic is the most useful diagnostic because it is the average of all requests sent to the back end.

The max statistic can be useful to determine whether some requests are taking substantially longer than the average.

The min statistic is typically not a useful measure, because it is the request/response with the lowest total time elapsed.

The count statistic is approximately equal to the sum for the RequestCount metric, because it is the number of samples taken.

Example: Suppose that your load balancer has 4 back-end instances with 2 instances in us-west-2a and 2 instances in us-west-2b. Requests sent to 1 instance in us-west-2a have a higher latency. The latency metric reported for us-west-2a has a higher value than the latency metric for us-west-2b.

HTTPCode_ELB_4XX

This metric is typically used with the LoadBalancerName dimension to view the total number of HTTP 4XX errors generated by the load balancer nodes. It can also be used to measure the number of errors for requests that were routed to an Availability Zone (which might not be the same Availability Zone that serviced the request for the back-end).

Preferred statistic: sum

The sum statistic is the only meaningful statistic for this measure.

The min, max, and average statistics are not meaningful and all return a value of 1.

The count statistic is the number of samples reported by all load balancer nodes and typically equals the sum for the period.

Example: Suppose that your load balancer has us-west-2a and us-west-2b enabled. Client requests include a malformed request URL. Client HTTP errors would likely increase in all Availability Zones. The regional metric would be the sum of the values for each Availability Zone.

HTTPCode_ELB_5XX

This metric is typically used with the LoadBalancerName dimension to view the total number of HTTP 5XX errors generated by all the load balancer nodes. It can also be used to measure the number of errors for requests that were routed to an Availability Zone (which might not be the same Availability Zone that serviced the request for the back end).

Preferred statistic: sum

The sum statistic is the only meaningful statistic for this measure.

The min, max, and average statistics are not meaningful and all return a value of 1.

The count statistic is the number of samples reported by all load balancer nodes and typically equals the sum for the period.

Example: Suppose that your load balancer has us-west-2a and us-west-2b enabled. Instances in us-west-2a are experiencing high latency and are slow to respond to requests. As a result, the load balancer nodes' surge queue in us-west-2a fills up, resulting in spillovers and returns a 503 error to clients (see the SpilloverCount metric). If us-west-2b continues to respond normally, the regional sum of HTTPCode_ELB_5XX equals the sum for us-west-2a.

HTTPCode_Backend_2XX

HTTPCode_Backend_3XX

HTTPCode_Backend_4XX

HTTPCode_Backend_5XX

This metric is typically used with the LoadBalancerName dimension to view the total number of HTTP response codes generated by back-end instances registered to a load balancer. It can also be used to measure the number of errors for requests that were routed to an Availability Zone (which might not be the same Availability Zone that serviced the request for the back end).

Preferred statistic: sum

The sum statistic is the only meaningful statistic for this measure.

The min, max, and average statistics are not meaningful and all return a value of 1.

The Count statistic is the number of samples reported by all load balancer nodes and typically equals the sum for the period.

Example: Suppose that your load balancer has 4 back-end instances with 2 instances in us-west-2a and 2 instances in us-west-2b. Requests sent to 1 instance in us-west-2a result in an HTTP 500 response. The metric reported for us-west-2a includes these error responses, while the metric in us-west-2b does not include these error responses. The regional total is equal to the total for us-west-2a.

BackendConnectionErrors

This metric can be viewed for all back-end instances or the back-end instances in a single Availability Zone.

Preferred statistic: sum

The sum statistic represents the total connection errors seen by all load balancer nodes for the given period of time.

The sum statistic is the only meaningful statistic for this measure.

The average, min, and max statistics are reported per load balancer node and are not typically useful. The difference between min and max or peak to average or average to trough might be useful to determine whether a single load balancer node is an outlier.

The count statistic is the number of samples reported by all load balancer nodes and typically equals the sum for the period.

Example: Suppose that your load balancer has 4 back-end instances with 2 instances in us-west-2a and 2 instances in us-west-2b. Attempts to connect to 1 instance in us-west-2a result in an increase in the back-end connection errors. The metric reported for us-west-2a includes these connection errors for the failed attempts, while the metric for us-west-2b does not include these errors. The regional total is equal to the total for us-west-2a.

SurgeQueueLength

This metric can be used to monitor the surge queue size for a single Availability Zone or for the region (the overall load balancer).

Preferred statistic: max

The max statistic is the most useful statistic because it represents the peak of requests that were queued. The maximum value is 1,024. If any load balancer node has a full queue of 1,024 requests, there will likely be spillovers (see the SpilloverCount metric).

The average statistic represents the number of requests that were in the queue on average across all load balancer nodes for a given period of time. It can be useful in combination with min and max to determine the range of queuing by load balancer nodes.

The sum statistic is not a useful statistic, because it is the total of all recorded samples across all load balancer nodes.

The min statistic is not typically useful because it represents the lowest value observed for any load balancer node for any sample.

The count statistic is not a meaningful statistic for this measure.

Example: Suppose that your load balancer has us-west-2a and us-west-2b enabled. Instances in us-west-2a are experiencing high latency and are slow to respond to requests. As a result, the load balancers' surge queue in us-west-2a fills up, with clients likely experiencing increased response times. If this continues, the load balancer will likely have spillovers (see the SpilloverCount metric). If us-west-2b continues to respond normally, the regional max will be the same as the max for us-west-2a, while us-west-2b will have a small value (or no metric) for the same period.

SpilloverCount

This metric is typically used with the LoadBalancerName dimension to view the total spillovers for the load balancer. It can also be used to measure the number of rejected requests that were routed to an Availability Zone.

Preferred statistic: sum

The sum statistic represents the total of all load balancer nodes reports for the given period of time.

The sum statistic is the only meaningful statistic for this measure.

The average, min, and max statistics are reported per load balancer node and are not typically useful for the spillover metric.

The min statistic is not typically useful because it represents the lowest value observed for any load balancer node for any sample.

The count statistic is not a meaningful statistic for this measure.

Example: Suppose that your load balancer has us-west-2a and us-west-2b enabled. Instances in us-west-2a are experiencing high latency and are slow to respond to requests. As a result, the load balancers' surge queue in us-west-2a fills up, resulting in load balancer spillovers. The spillover metric will be incremented. If us-west-2b continues to respond normally, the regional sum will be the same as the sum for us-west-2a, while us-west-2b will have a small value (or no metric) for the same period.

View CloudWatch Metrics for Your Load Balancer

You can view the CloudWatch metrics for your load balancers using the Amazon EC2 console. These metrics are displayed as monitoring graphs. The monitoring graphs show data points if the load balancer is active and receiving requests.

Note

Alternatively, you can view metrics for your load balancer using the CloudWatch console. For more information, see Viewing Your AWS Metrics with Amazon CloudWatch in the Amazon CloudWatch Developer Guide.

To view the metrics from your load balancer

  1. Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/.

  2. In the navigation pane, under NETWORK & SECURITY, click Load Balancers.

  3. Select your load balancer.

  4. In the bottom pane, click the Monitoring tab.

  5. To filter the results by time, select a time range from Showing data for.

  6. Select an individual graph to get a larger view of the individual metric. The following metrics are available:

    • Sum HTTP 2XXs

    • Sum HTTP 4XXs

    • Sum HTTP 5XXs

    • Sum ELB HTTP 4XXs

    • Sum ELB HTTP 5XXs

    • Healthy Hosts

    • Unhealthy Hosts

    • Average Latency

    • Sum Requests

    • Backend Connection Errors

    • Surge Queue Length

    • Spillover Count

Create CloudWatch Alarms for Your Load Balancer

An alarm watches a single metric over the time period that you specify. Depending on the value of the metric relative to a threshold that you define, the alarm can send one or more notifications using Amazon SNS, a service that enables applications, end users, and devices to instantly send and receive notifications. For more information, see Get Started with Amazon SNS.

An alarm sends notifications to Amazon SNS when the specified metric reaches the defined range and remains in that range for a specified period of time. An alarm has three possible states:

  • OK—The value of the metric is within the range you've specified.

  • ALARM—The value of the metric is outside the range that you've specified for the specified period of time.

  • INSUFFICIENT_DATA—Either the metric is not yet available or there is not enough data is available for the metric to determine the alarm state.

Whenever the state of an alarm changes, CloudWatch uses Amazon SNS to send a notification to the email addresses that you specified.

Use the following procedure to create an alarm for your load balancer using the Amazon EC2 console. The alarm sends notifications to an SNS topic whenever the load balancer's latency is above 120 seconds for 1 consecutive period of 5 minutes. Note that a short period creates a more sensitive alarm, while a longer period can mitigate brief spikes in a metric.

Note

Alternately, you can create an alarm for your load balancer using the CloudWatch console. For more information, see Send Email Based on Load Balancer Alarm in the Amazon CloudWatch Developer Guide.

To create an alarm for your load balancer

  1. Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/.

  2. In the navigation pane, under NETWORK & SECURITY, click Load Balancers.

  3. Select your load balancer.

  4. From the Monitoring tab, click Create Alarm.

  5. If you have an SNS topic that you want to use, select it from Send a notification to. Otherwise, create an SNS topic as follows:

    1. Click create topic.

    2. In Send a notification to, enter a name for your topic.

    3. In With these recipients, enter the email addresses of the recipients to notify, separated by commas. You can enter up to 10 email addresses. Each recipient receives an email from Amazon SNS with a link to subscribe to the SNS topic in order to receive notifications.

  6. Define the threshold for your alarm as follows. For Whenever, select Average and Average Latency. For Is, select > and enter 120. For For at least, specify 1 consecutive period of 5 minutes.

  7. In Name of alarm, a name is automatically generated for you. If you prefer, you can specify a different name.

  8. Click Create Alarm.