Try it now and let us know what you think. Switch to the new look >>
You can return to the original look by selecting English in the language selector above.
Using Amazon CloudWatch Alarms
You can create a CloudWatch alarm that watches a single CloudWatch metric or the result of a math expression based on CloudWatch metrics. The alarm performs one or more actions based on the value of the metric or expression relative to a threshold over a number of time periods. The action can be an Amazon EC2 action, an Amazon EC2 Auto Scaling action, or a notification sent to an Amazon SNS topic.
You can also add alarms to CloudWatch dashboards and monitor them visually. When an
alarm is on a dashboard, it turns red when it
is in the ALARM state, making it easier for you to monitor
its status proactively.
Alarms invoke actions for sustained state changes only. CloudWatch alarms don't invoke actions simply because they are in a particular state, the state must have changed and been maintained for a specified number of periods.
After an alarm invokes an action due to a change in state, its subsequent behavior depends on the type of action that you have associated with the alarm. For Amazon EC2 Auto Scaling actions, the alarm continues to invoke the action for every period that the alarm remains in the new state. For Amazon SNS notifications, no additional actions are invoked.
Note
CloudWatch doesn't test or validate the actions that you specify, nor does it detect any Amazon EC2 Auto Scaling or Amazon SNS errors resulting from an attempt to invoke nonexistent actions. Make sure that your actions exist.
Alarm States
An alarm has the following possible states:
-
OK– The metric or expression is within the defined threshold. -
ALARM– The metric or expression is outside of the defined threshold. -
INSUFFICIENT_DATA– The alarm has just started, the metric is not available, or not enough data is available for the metric to determine the alarm state.
Evaluating an Alarm
When you create an alarm, you specify three settings to enable CloudWatch to evaluate when to change the alarm state:
-
Period is the length of time to evaluate the metric or expression to create each individual data point for an alarm. It is expressed in seconds. If you choose one minute as the period, there is one data point every minute.
-
Evaluation Period is the number of the most recent periods, or data points, to evaluate when determining alarm state.
-
Datapoints to Alarm is the number of data points within the evaluation period that must be breaching to cause the alarm to go to the
ALARMstate. The breaching data points don't have to be consecutive, they just must all be within the last number of data points equal to Evaluation Period.
In the following figure, the alarm threshold is set to three units. The alarm is
configured to go to the ALARM state and both Evaluation
Period and Datapoints to Alarm are 3. That is, when all
existing data points in the most recent three consecutive periods are above the threshold,
the
alarm goes to the ALARM state. In the figure, this happens in the third through
fifth time periods. At period six, the value dips below the threshold, so one of the
periods
being evaluated is not breaching, and the alarm state changes to OK. During the
ninth time period, the threshold is breached again, but for only one period. Consequently,
the
alarm state remains OK.
When you configure Evaluation Period and Datapoints to Alarm as different values, you're setting an "M out of N" alarm. Datapoints to Alarm is ("M") and Evaluation Period is ("N"). The evaluation interval is the number of data points multiplied by the period. For example, if you configure 4 out of 5 data points with a period of 1 minute, the evaluation interval is 5 minutes. If you configure 3 out of 3 data points with a period of 10 minutes, the evaluation interval is 30 minutes.
Configuring How CloudWatch Alarms Treat Missing Data
Sometimes some data points for a metric with an alarm don't get reported to CloudWatch. For example, this can happen when a connection is lost, a server goes down, or when a metric reports data only intermittently by design.
CloudWatch enables you to specify how to treat missing data points when evaluating
an alarm.
This can help you configure your alarm to go to the ALARM state when appropriate
for the type of data being monitored. You can avoid false positives when missing data
doesn't
indicate a problem.
Similar to how each alarm is always in one of three states, each specific data point reported to CloudWatch falls under one of three categories:
-
Not breaching (within the threshold)
-
Breaching (violating the threshold)
-
Missing
For each alarm, you can specify CloudWatch to treat missing data points as any of the following:
-
notBreaching– Missing data points are treated as "good" and within the threshold, -
breaching– Missing data points are treated as "bad" and breaching the threshold -
ignore– The current alarm state is maintained -
missing– The alarm doesn't consider missing data points when evaluating whether to change state
The best choice depends on the type of metric. For a metric that continually reports
data,
such as CPUUtilization of an instance, you might want to treat missing data
points as breaching, because they might indicate that something is wrong. But for
a metric that generates data points only when an error occurs, such as
ThrottledRequests in Amazon DynamoDB, you would want to treat missing data as
notBreaching. The default behavior is missing.
Choosing the best option for your alarm prevents unnecessary and misleading alarm condition changes, and also more accurately indicates the health of your system.
How Alarm State Is Evaluated When Data Is Missing
No matter what value you set for how to treat missing data, when an alarm evaluates whether to change state, CloudWatch attempts to retrieve a higher number of data points than specified by Evaluation Periods. The exact number of data points it attempts to retrieve depends on the length of the alarm period and whether it is based on a metric with standard resolution or high resolution. The time frame of the data points that it attempts to retrieve is the evaluation range.
Once CloudWatch retrieves these data points, the following happens:
-
If no data points in the evaluation range are missing, CloudWatch evaluates the alarm based on the most recent data points collected.
-
If some data points in the evaluation range are missing, but the number of existing data points retrieved is equal to or more than the alarm's Evaluation Periods, CloudWatch evaluates the alarm state based on the most recent existing data points that were successfully retrieved. In this case, the value you set for how to treat missing data is not needed and is ignored.
-
If some data points in the evaluation range are missing, and the number of existing data points that were retrieved is lower than the alarm's number of evaluation periods, CloudWatch fills in the missing data points with the result you specified for how to treat missing data, and then evaluates the alarm. However, any real data points in the evaluation range, no matter when they were reported, are included in the evaluation. CloudWatch uses missing data points only as few times as possible.
In all of these situations, the number of data points evaluated is equal to the value
of
Evaluation Periods. If fewer than the value of Datapoints
to Alarm are breaching, the alarm state is set to OK. Otherwise,
the state is set to ALARM.
Note
A particular case of this behavior is that CloudWatch alarms might repeatedly re-evaluate the last set of data points for a period of time after the metric has stopped flowing. This re-evaluation might cause the alarm to change state and re-execute actions, if it had changed state immediately prior to the metric stream stopping. To mitigate this behavior, use shorter periods.
The following tables illustrate examples of the alarm evaluation behavior. In the first table, Datapoints to Alarm and Evaluation Periods are both 3. CloudWatch retrieves the 5 most recent data points when evaluating the alarm.
Column 2 shows how many of the 3 necessary data points are missing. Even though the most recent 5 data points are evaluated, only 3 (the setting for Evaluation Periods) are necessary to evaluate the alarm state. The number of data points in Column 2 is the number of data points that must be "filled in", using the setting for how missing data is being treated.
Columns 3-6 show the alarm state that would be set for each setting of how missing data should be treated, shown at the top of each column. In the data points column, 0 is a non-breaching data point, X is a breaching data point, and - is a missing data point.
| Data points | # of missing data points | MISSING | IGNORE | BREACHING | NOT BREACHING |
|---|---|---|---|---|---|
|
0 - X - X |
0 |
|
|
|
|
|
0 - - - - |
2 |
|
|
|
|
|
- - - - - |
3 |
|
Retain current state |
|
|
|
0 X X - X |
0 |
|
|
|
|
|
- - X - - |
2 |
|
|
|
|
In the second row of the preceding table, the alarm stays OK even if
missing data is treated as breaching, because the one existing data point is not breaching,
and this is evaluated along with two missing data points which are treated as breaching.
The
next time this alarm is evaluated, if the data is still missing it will go to
ALARM, as that non-breaching data point will no longer be among the 5 most
recent data points retrieved. In the fourth row, the alarm goes to ALARM state
in all cases because there are enough real data points so that the setting for how
to treat
missing data doesn't need to be considered.
In the next table, the Period is again set to 5 minutes, and Datapoints to Alarm is only 2 while Evaluation Periods is 3. This is a 2 out of 3, M out of N alarm.
| Data points | # of missing data points | MISSING | IGNORE | BREACHING | NOT BREACHING |
|---|---|---|---|---|---|
|
0 - X - X |
0 |
|
|
|
|
|
0 0 X 0 X |
0 |
|
|
|
|
|
0 - X - - |
1 |
|
|
|
|
|
- - - - 0 |
2 |
|
|
|
|
|
- - X - - |
2 |
|
Retain current state |
|
|
If data points are missing soon after you create an alarm, and the metric was being reported to CloudWatch before you created the alarm, CloudWatch retrieves the most recent data points from before the alarm was created when evaluating the alarm.
High-Resolution Alarms
If you set an alarm on a high-resolution metric, you can specify a high-resolution alarm with a period of 10 seconds or 30 seconds, or you can set a regular alarm with a period of any multiple of 60 seconds. There is a higher charge for high-resolution alarms. For more information about high-resolution metrics, see Publishing Custom Metrics.
Alarms on Math Expressions
You can set an alarm on the result of a math expression that is based on one or more CloudWatch metrics. A math expression used for an alarm can include as many as 10 metrics. Each metric must be using the same period.
For an alarm based on a math expression, you can specify how you want CloudWatch to treat missing data points for the underlying metrics when evaluating the alarm.
Alarms based on math expressions can't perform Amazon EC2 actions.
For more information about metric math expressions and syntax, see Using Metric Math.
Percentile-Based CloudWatch Alarms and Low Data Samples
When you set a percentile as the statistic for an alarm, you can specify what to do when there is not enough data for a good statistical assessment. You can choose to have the alarm evaluate the statistic anyway and possibly change the alarm state. Or, you can have the alarm ignore the metric while the sample size is low, and wait to evaluate it until there is enough data to be statistically significant.
For percentiles between 0.5 and 1.00, this setting is used when there are fewer than 10/(1-percentile) data points during the evaluation period. For example, this setting would be used if there were fewer than 1000 samples for an alarm on a p99 percentile. For percentiles between 0 and 0.5, the setting is used when there are fewer than 10/percentile data points.
Common Features of CloudWatch Alarms
The following features apply to all CloudWatch alarms:
-
You can create up to 5000 alarms per Region per AWS account. To create or update an alarm, you use the
PutMetricAlarmAPI action (mon-put-metric-alarmcommand). -
Alarm names must contain only ASCII characters.
-
You can list any or all of the currently configured alarms, and list any alarms in a particular state using
DescribeAlarms(mon-describe-alarms). You can further filter the list by time range. -
You can disable and enable alarms by using
DisableAlarmActionsandEnableAlarmActions(mon-disable-alarm-actionsandmon-enable-alarm-actions). -
You can test an alarm by setting it to any state using
SetAlarmState(mon-set-alarm-state). This temporary state change lasts only until the next alarm comparison occurs. -
You can create an alarm using
PutMetricAlarm(mon-put-metric-alarm) before you've created a custom metric. For the alarm to be valid, you must include all of the dimensions for the custom metric in addition to the metric namespace and metric name in the alarm definition. -
You can view an alarm's history using
DescribeAlarmHistory(mon-describe-alarm-history). CloudWatch preserves alarm history for two weeks. Each state transition is marked with a unique timestamp. In rare cases, your history might show more than one notification for a state change. The timestamp enables you to confirm unique state changes. -
The number of evaluation periods for an alarm multiplied by the length of each evaluation period can't exceed one day.
Note
Some AWS resources don't send metric data to CloudWatch under certain conditions.
For example, Amazon EBS might not send metric data for an available volume that is
not attached
to an Amazon EC2 instance, because there is no metric activity to be monitored for
that volume.
If you have an alarm set for such a metric, you might notice its state change to
INSUFFICIENT_DATA. This might indicate that your resource is inactive, and
might not necessarily mean that there is a problem.
