Recommended CloudWatch alarms for Amazon OpenSearch Service - Amazon OpenSearch Service (successor to Amazon Elasticsearch Service)

Recommended CloudWatch alarms for Amazon OpenSearch Service

CloudWatch alarms perform an action when a CloudWatch metric exceeds a specified value for some amount of time. For example, you might want AWS to email you if your cluster health status is red for longer than one minute. This section includes some recommended alarms for Amazon OpenSearch Service and how to respond to them.

For more information about setting alarms, see Creating Amazon CloudWatch Alarms in the Amazon CloudWatch User Guide.

Alarm Issue
ClusterStatus.red maximum is >= 1 for 1 minute, 1 consecutive time At least one primary shard and its replicas are not allocated to a node. See Red cluster status.
ClusterStatus.yellow maximum is >= 1 for 1 minute, 1 consecutive time At least one replica shard is not allocated to a node. See Yellow cluster status.
FreeStorageSpace minimum is <= 20480 for 1 minute, 1 consecutive time A node in your cluster is down to 20 GiB of free storage space. See Lack of available storage space. This value is in MiB, so rather than 20480, we recommend setting it to 25% of the storage space for each node.
ClusterIndexWritesBlocked is >= 1 for 5 minutes, 1 consecutive time Your cluster is blocking write requests. See ClusterBlockException.
Nodes minimum is < x for 1 day, 1 consecutive time x is the number of nodes in your cluster. This alarm indicates that at least one node in your cluster has been unreachable for one day. See Failed cluster nodes.
AutomatedSnapshotFailure maximum is >= 1 for 1 minute, 1 consecutive time An automated snapshot failed. This failure is often the result of a red cluster health status. See Red cluster status.

For a summary of all automated snapshots and some information about failures, try one of the following requests:

GET domain_endpoint/_snapshot/cs-automated/_all GET domain_endpoint/_snapshot/cs-automated-enc/_all
CPUUtilization or WarmCPUUtilization maximum is >= 80% for 15 minutes, 3 consecutive times 100% CPU utilization isn't uncommon, but sustained high usage is problematic. Consider using larger instance types or adding instances.
JVMMemoryPressure maximum is >= 80% for 5 minutes, 3 consecutive times The cluster could encounter out of memory errors if usage increases. Consider scaling vertically. OpenSearch Service uses half of an instance's RAM for the Java heap, up to a heap size of 32 GiB. You can scale instances vertically up to 64 GiB of RAM, at which point you can scale horizontally by adding instances.
MasterCPUUtilization maximum is >= 50% for 15 minutes, 3 consecutive times Consider using larger instance types for your dedicated master nodes. Because of their role in cluster stability and blue/green deployments, dedicated master nodes should have lower CPU usage than data nodes.
MasterJVMMemoryPressure maximum is >= 80% for 15 minutes, 1 consecutive time
KMSKeyError is >= 1 for 1 minute, 1 consecutive time The KMS encryption key that is used to encrypt data at rest in your domain is disabled. Re-enable it to restore normal operations. For more information, see Encryption of data at rest for Amazon OpenSearch Service.
KMSKeyInaccessible is >= 1 for 1 minute, 1 consecutive time The KMS encryption key that is used to encrypt data at rest in your domain has been deleted or has revoked its grants to OpenSearch Service. You can't recover domains that are in this state, but if you have a manual snapshot, you can use it to migrate to a new domain. To learn more, see Encryption of data at rest for Amazon OpenSearch Service.
shards.active is >= 30000 for 1 minute, 1 consecutive time

The total number of active primary and replica shards is greater than 30,000. You might be rotating your indices too frequently. Consider using ISM to remove indices once they reach a specific age.

5xx alarms >= 10% of OpenSearchRequests One or more data nodes might be overloaded, or requests are failing to complete within the idle timeout period. Consider switching to larger instance types or adding more nodes to the cluster. Confirm that you're following best practices for shard and cluster architecture.
MasterReachableFromNode is < x for 1 day, 1 consecutive time

x is the number of nodes in your cluster. This alarm indicates that the master node stopped or is unreachable. These failures are usually the result of a network connectivity issue or an AWS dependency problem.

ThreadpoolIndexQueue average is >= 100 for 1 minute, 1 consecutive time The cluster is experiencing high indexing concurrency. Review and control indexing requests, or increase cluster resources.
ThreadpoolSearchQueue average is >= 500 for 1 minute, 1 consecutive time The cluster is experiencing high search concurrency. Consider scaling your cluster. You can also increase the search queue size, but increasing it excessively can cause out of memory errors.

ThreadpoolSearchQueue maximum is >= 5000 for 1 minute, 1 consecutive time

Note

If you just want to view metrics, see Monitoring OpenSearch cluster metrics with Amazon CloudWatch.

Other alarms you might consider

Consider configuring the following alarms depending on which OpenSearch Service features you regularly use.

Alarm Issue
WarmFreeStorageSpace minimum is <= 10240 for 1 minute, 1 consecutive time An UltraWArm node in your cluster is down to 10 GiB of free storage space. See Lack of available storage space. This value is in MiB, so rather than 10240, we recommend setting it to 10% of the storage space for each UltraWarm node.
HotToWarmMigrationQueueSize is >= 20 for 1 minute, 3 consecutive times

A high number of indices are concurrently moving from hot to UltraWarm storage. Consider scaling your cluster.

HotToWarmMigrationSuccessLatency is >= 1 day, 1 consecutive time

Configure this alarm so that you're notified if the HotToWarmMigrationSuccessCount x latency is greater than 24 hours if you’re trying to roll daily indices.

WarmJVMMemoryPressure maximum is >= 80% for 5 minutes, 3 consecutive times The cluster could encounter out of memory errors if usage increases. Consider scaling vertically. OpenSearch Service uses half of an instance's RAM for the Java heap, up to a heap size of 32 GiB. You can scale instances vertically up to 64 GiB of RAM, at which point you can scale horizontally by adding instances.
WarmToColdMigrationQueueSize is >= 20 for 1 minute, 3 consecutive times

A high number of indices are concurrently moving from UltraWarm to cold storage. Consider scaling your cluster.

HotToWarmMigrationFailureCount is >= 1 for 1 minute, 1 consecutive time

Migrations might fail during snapshots, shard relocations, or force merges. Failures during snapshots or shard relocation are typically due to node failures or S3 connectivity issues. Lack of disk space is usually the underlying cause of force merge failures.

WarmToColdMigrationFailureCount is >= 1 for 1 minute, 1 consecutive time Migrations usually fail when attempts to migrate index metadata to cold storage fail. Failures can also happen when the warm index cluster state is being removed.
WarmToColdMigrationLatency is >= 1 day, 1 consecutive time

Configure this alarm so that you're notified if the WarmToColdMigrationSuccessCount x latency is greater than 24 hours if you’re trying to roll daily indices.

AlertingDegraded is >= 1 for 1 minute, 1 consecutive time

Either the alerting index is red, or one or more nodes is not on schedule.

ADPluginUnhealthy is >= 1 for 1 minute, 1 consecutive time

The anomaly detection plugin isn't functioning properly, either because of high failure rates or because one of the indices being used is red.

AsynchronousSearchFailureRate is >= 1 for 1 minute, 1 consecutive time

At least one asynchronous search failed in the last minute, which likely means the coordinator node failed. The lifecycle of an asynchronous search request is managed solely on the coordinator node, so if the coordinator goes down, the request fails.

AsynchronousSearchStoreHealth is >= 1 for 1 minute, 1 consecutive time

The health of the asynchronous search response store in the persisted index is red. You might be storing large asynchronous responses, which can destabilize a cluster. Try to limit your asynchronous search responses to 10 MB or less.

SQLUnhealthy is >= 1 for 1 minute, 3 consecutive times

The SQL plugin is returning 5xx response codes or passing invalid query DSL to OpenSearch. Troubleshoot the requests your clients are making to the plugin.

LTRStatus.red is >= 1 for 1 minute, 1 consecutive time

At least one of the indices needed to run the Learning to Rank plugin has missing primary shards and is not functional.