Menu
Amazon Elasticsearch Service
Developer Guide (API Version 2015-01-01)

Recommended CloudWatch Alarms

CloudWatch alarms perform an action when a CloudWatch metric exceeds a specified value for some amount of time. For example, you might want AWS to email you if your cluster health status is red for longer than one minute. This section includes some recommended alarms and how to respond to them.

For more information about setting alarms, see Creating Amazon CloudWatch Alarms in the Amazon CloudWatch User Guide.

Alarm Issue
ClusterStatus.red maximum is >= 1 for 1 minute, 1 consecutive time At least one primary shard and its replicas are not allocated to a node. See Red Cluster Status.
ClusterStatus.yellow maximum is >= 1 for 1 minute, 1 consecutive time At least one replica shard is not allocated to a node. See Yellow Cluster Status.
FreeStorageSpace minimum is <= 20000 for 1 minute, 1 consecutive time A node in your cluster is down to 20 GB of free storage space. See Lack of Available Storage Space. This value is in MB, so rather than 20000, we recommend setting it to 25% of the storage space for each node.
ClusterIndexWritesBlocked is >= 1 for 5 minutes, 1 consecutive time Your cluster is blocking write requests. See ClusterBlockException.
Nodes minimum is < x for 1 day, 1 consecutive time x is the number of nodes in your cluster. This alarm indicates that at least one node in your cluster has been unreachable for one day. See Failed Cluster Nodes.
AutomatedSnapshotFailure maximum is >= 1 for 1 minute, 1 consecutive time An automated snapshot failed. This failure is often the result of a red cluster health status. See Red Cluster Status.

For a summary of all automated snapshots and some information about failures, you can also try the following:

GET domain_endpoint/_snapshot/cs-automated/_all
CPUUtilization average is >= 80% for 15 minutes, 3 consecutive times 100% CPU utilization isn't uncommon, but sustained high averages are problematic. Consider using larger instance types or adding instances.
JVMMemoryPressure maximum is >= 80% for 15 minutes, 1 consecutive time The cluster could encounter out of memory errors if usage increases. Consider scaling vertically. Amazon ES uses half of an instance's RAM for the Java heap, up to a heap size of 32 GB. You can scale instances vertically up to 64 GB of RAM, at which point you can scale horizontally by adding instances.
MasterCPUUtilization average is >= 50% for 15 minutes, 3 consecutive times Consider using larger instance types for your dedicated master nodes. Because of their role in cluster stability and blue/green deployments, dedicated master nodes should have lower average CPU usage than data nodes.
MasterJVMMemoryPressure maximum is >= 80% for 15 minutes, 1 consecutive time
KMSKeyError is >= 1 for 1 minute, 1 consecutive time The KMS encryption key that is used to encrypt data at rest in your domain is disabled. Re-enable it to restore normal operations. To learn more, see Encryption of Data at Rest for Amazon Elasticsearch Service.
KMSKeyInaccessible is >= 1 for 1 minute, 1 consecutive time The KMS encryption key that is used to encrypt data at rest in your domain has been deleted or has revoked its grants to Amazon ES. You can't recover domains that are in this state, but if you have a manual snapshot, you can use it to migrate to a new domain. To learn more, see Encryption of Data at Rest for Amazon Elasticsearch Service.

Note

If you just want to view metrics, see Monitoring CloudWatch Metrics.