Using CloudWatch Alarms with Amazon Kinesis Data Analytics for Apache Flink
Using Amazon CloudWatch metric alarms, you watch a CloudWatch metric over a time period that you specify. The alarm performs one or more actions based on the value of the metric or expression relative to a threshold over a number of time periods. An example of an action is sending a notification to an Amazon Simple Notification Service (Amazon SNS) topic.
For more information about CloudWatch alarms, see Using Amazon CloudWatch Alarms.
Recommended Alarms
This section contains the recommended alarms for monitoring Kinesis Data Analytics applications.
The table describes the recommended alarms and has the following columns:
-
Metric Expression: The metric or metric expression to test against the threshold.
-
Statistic: The statistic used to check the metric—for example, Average.
-
Threshold: Using this alarm requires you to determine a threshold that defines the limit of expected application performance. You need to determine this threshold by monitoring your application under normal conditions.
-
Description: Causes that might trigger this alarm, and possible solutions for the condition.
Metric Expression | Statistic | Threshold | Description |
---|---|---|---|
downtime > 0 |
Average | 0 | Recommended for all applications. The Downtime metric measures the
duration of an outage. A downtime greater than zero indicates that the
application has failed. For troubleshooting, see
Application is Restarting. |
RATE (numberOfFailedCheckpoints) > 0 |
Average | 0 | Recommended for all applications. Use this metric to monitor application health and checkpointing progress. The application saves state data to checkpoints when it's healthy. Checkpointing can fail due to timeouts if the application isn't making progress in processing the input data. For troubleshooting, see Checkpointing is timing out. |
Operator.numRecordsOutPerSecond < threshold |
Average | The minimum number of records emitted from the application during normal conditions. | Recommended for all applications. Falling below this threshold can indicate that the application isn't making expected progress on the input data. For troubleshooting, see Throughput is Too Slow. |
records_lag_max|millsBehindLatest > threshold |
Maximum | The maximum expected latency during normal conditions. | Recommended for all applications. Use the records_lag_max metric for a Kafka source, or the millisBehindLatest for a Kinesis stream source. Rising above this threshold can indicate that the application isn't making expected progress on the input data.
For troubleshooting, see Throughput is Too Slow. |
lastCheckpointDuration > threshold |
Maximum | The maximum expected checkpoint duration during normal conditions. | If the lastCheckpointDuration continuously increases, rising above this threshold can indicate that the application isn't making expected progress on the input data, or that there are problems with application health such as backpressure.
For troubleshooting, see Application State Data is Accumulating. |
lastCheckpointSize > threshold |
Maximum | The maximum expected checkpoint size during normal conditions. | If the lastCheckpointSize continuously increases, rising above this threshold can indicate that the application is accumulating state data. If the state data becomes too large, the application can run out of memory when recovering from a checkpoint, or recovering from a checkpoint might take too long.
For troubleshooting, see Application State Data is Accumulating. |
heapMemoryUtilization > threshold |
Maximum | The maximum expected heapMemoryUtilization size during normal
conditions, with a recommended value of 90 percent. |
You can use this metric to monitor the maximum memory utilization of task managers across the application. If the application reaches this threshold, you need to provision more resources. You do this by enabling automatic scaling or increasing the application parallelism. For more information about increasing resources, see Scaling. |
cpuUtilization > threshold |
Maximum | The maximum expected cpuUtilization size during normal conditions,
with a recommended value of 80 percent. |
You can use this metric to monitor the maximum CPU utilization of task managers across the application. If the application reaches this threshold, you need to provision more resources You do this by enabling automatic scaling or increasing the application parallelism. For more information about increasing resources, see Scaling. |
threadsCount > threshold |
Maximum | The maximum expected threadsCount size during normal
conditions. |
You can use this metric to watch for thread leaks in task managers across the application. If this metric reaches this threshold, check your application code for threads being created without being closed. |
(oldGarbageCollectionTime * 100)/60_000 over 1 min period') >
threshold |
Maximum | The maximum expected oldGarbageCollectionTime duration. We
recommend setting a threshold such that typical garbage collection time is 60
percent of the specified threshold, but the correct threshold for your
application will vary. |
If this metric is continually increasing, this can indicate that there is a memory leak in task managers across the application. |
RATE(oldGarbageCollectionCount) > threshold |
Maximum | The maximum expected oldGarbageCollectionCount under normal
conditions. The correct threshold for your application will vary. |
If this metric is continually increasing, this can indicate that there is a memory leak in task managers across the application. |
Operator.currentOutputWatermark - Operator.currentInputWatermark
> threshold |
Minimum | The minimum expected watermark increment under normal conditions. The correct threshold for your application will vary. | If this metric is continually increasing, this can indicate that either the application is processing increasingly older events, or that an upstream subtask has not sent a watermark in an increasingly long time. |