Recommended CloudWatch alarms for Amazon OpenSearch Service

CloudWatch alarms perform an action when a CloudWatch metric exceeds a specified value for some amount of time. For example, you might want AWS to email you if your cluster health status is red for longer than one minute. This section includes some recommended alarms for Amazon OpenSearch Service and how to respond to them.

You can automatically deploy these alarms using AWS CloudFormation. For a sample stack, see the related GitHub repository.

Note

If you deploy the CloudFormation stack, the KMSKeyError and KMSKeyInaccessible alarms will exists in an Insufficient Data state because these metrics only appear if a domain encounters a problem with its encryption key.

For more information about configuring alarms, see Creating Amazon CloudWatch Alarms in the Amazon CloudWatch User Guide.

Alarm	Issue
`ClusterStatus.red` maximum is >= 1 for 1 minute, 1 consecutive time	At least one primary shard and its replicas are not allocated to a node. See Red cluster status.
`ClusterStatus.yellow` maximum is >= 1 for 1 minute, 5 consecutive times	At least one replica shard is not allocated to a node. See Yellow cluster status.
`FreeStorageSpace` minimum is <= 20480 for 1 minute, 1 consecutive time	A node in your cluster is down to 20 GiB of free storage space. See Lack of available storage space. This value is in MiB, so rather than 20480, we recommend setting it to 25% of the storage space for each node.
`ClusterIndexWritesBlocked` is >= 1 for 5 minutes, 1 consecutive time	Your cluster is blocking write requests. See ClusterBlockException.
`Nodes` minimum is < x for 1 day, 1 consecutive time	x is the number of nodes in your cluster. This alarm indicates that at least one node in your cluster has been unreachable for one day. See Failed cluster nodes.
`AutomatedSnapshotFailure` maximum is >= 1 for 1 minute, 1 consecutive time	An automated snapshot failed. This failure is often the result of a red cluster health status. See Red cluster status. For a summary of all automated snapshots and some information about failures, try one of the following requests: `GET domain_endpoint/_snapshot/cs-automated/_all GET domain_endpoint/_snapshot/cs-automated-enc/_all`
`CPUUtilization` or `WarmCPUUtilization` maximum is >= 80% for 15 minutes, 3 consecutive times	100% CPU utilization might occur sometimes, but sustained high usage is problematic. Consider using larger instance types or adding instances.
`JVMMemoryPressure` maximum is >= 95% for 1 minute, 3 consecutive times	The cluster could encounter out of memory errors if usage increases. Consider scaling vertically. OpenSearch Service uses half of an instance's RAM for the Java heap, up to a heap size of 32 GiB. You can scale instances vertically up to 64 GiB of RAM, at which point you can scale horizontally by adding instances.
`OldGenJVMMemoryPressure` maximum is >= 80% for 1 minute, 3 consecutive times
`MasterCPUUtilization` maximum is >= 50% for 15 minutes, 3 consecutive times	Consider using larger instance types for your dedicated master nodes. Because of their role in cluster stability and blue/green deployments, dedicated master nodes should have lower CPU usage than data nodes.
`MasterJVMMemoryPressure` maximum is >= 95% for 1 minute, 3 consecutive times
`MasterOldGenJVMMemoryPressure` maximum is >= 80% for 1 minute, 3 consecutive times
`KMSKeyError` is >= 1 for 1 minute, 1 consecutive time	The AWS KMS encryption key that is used to encrypt data at rest in your domain is disabled. Re-enable it to restore normal operations. For more information, see Encryption of data at rest for Amazon OpenSearch Service.
`KMSKeyInaccessible` is >= 1 for 1 minute, 1 consecutive time	The AWS KMS encryption key that is used to encrypt data at rest in your domain has been deleted or has revoked its grants to OpenSearch Service. You can't recover domains that are in this state. However, if you have a manual snapshot, you can use it to migrate to a new domain. To learn more, see Encryption of data at rest for Amazon OpenSearch Service.
`shards.active` is >= 30000 for 1 minute, 1 consecutive time	The total number of active primary and replica shards is greater than 30,000. You might be rotating your indexes too frequently. Consider using ISM to remove indexes once they reach a specific age.
`5xx` alarms >= 10% of `OpenSearchRequests`	One or more data nodes might be overloaded, or requests are failing to complete within the idle timeout period. Consider switching to larger instance types or adding more nodes to the cluster. Confirm that you're following best practices for shard and cluster architecture.
`MasterReachableFromNode` maximum is < 1 for 5 minutes, 1 consecutive time	This alarm indicates that the master node stopped or is unreachable. These failures are usually the result of a network connectivity issue or an AWS dependency problem.
`ThreadpoolWriteQueue` average is >= 100 for 1 minute, 1 consecutive time	The cluster is experiencing high indexing concurrency. Review and control indexing requests, or increase cluster resources.
`ThreadpoolSearchQueue` average is >= 500 for 1 minute, 1 consecutive time	The cluster is experiencing high search concurrency. Consider scaling your cluster. You can also increase the search queue size, but increasing it excessively can cause out of memory errors.
`ThreadpoolSearchQueue` maximum is >= 5000 for 1 minute, 1 consecutive time
Increase in `ThreadpoolSearchRejected` SUM is >=1{ math expression DIFF ( )} for 1 minute, 1 consecutive time	These alarms notify you of domain issues that might impact performance and stability.
Increase in `ThreadpoolWriteRejected` SUM is >=1{ math expression DIFF ( )} for 1 minute, 1 consecutive time

Note

If you just want to view metrics, see Monitoring OpenSearch cluster metrics with Amazon CloudWatch.

Other alarms you might consider

Consider configuring the following alarms depending on which OpenSearch Service features you regularly use.

Alarm	Issue
`WarmFreeStorageSpace` is >= 10%	You have reached 10% of your total free warm storage. `WarmFreeStorageSpace` measures the sum of your free warm storage space in MiB. UltraWarm uses Amazon S3 rather than attached disks.
`HotToWarmMigrationQueueSize` is >= 20 for 1 minute, 3 consecutive times	A high number of indexes are concurrently moving from hot to UltraWarm storage. Consider scaling your cluster.
`HotToWarmMigrationSuccessLatency` is >= 1 day, 1 consecutive time	Configure this alarm so that you're notified if the `HotToWarmMigrationSuccessCount` x latency is greater than 24 hours if you’re trying to roll daily indexes.
`WarmJVMMemoryPressure` maximum is >= 95% for 1 minute, 3 consecutive times	The cluster could encounter out of memory errors if usage increases. Consider scaling vertically. OpenSearch Service uses half of an instance's RAM for the Java heap, up to a heap size of 32 GiB. You can scale instances vertically up to 64 GiB of RAM, at which point you can scale horizontally by adding instances.
`WarmOldGenJVMMemoryPressure` maximum is >= 80% for 1 minute, 3 consecutive times
`WarmToColdMigrationQueueSize` is >= 20 for 1 minute, 3 consecutive times	A high number of indexes are concurrently moving from UltraWarm to cold storage. Consider scaling your cluster.
`HotToWarmMigrationFailureCount` is >= 1 for 1 minute, 1 consecutive time	Migrations might fail during snapshots, shard relocations, or force merges. Failures during snapshots or shard relocation are typically due to node failures or S3 connectivity issues. Lack of disk space is usually the underlying cause of force merge failures.
`WarmToColdMigrationFailureCount` is >= 1 for 1 minute, 1 consecutive time	Migrations usually fail when attempts to migrate index metadata to cold storage fail. Failures can also happen when the warm index cluster state is being removed.
`WarmToColdMigrationLatency` is >= 1 day, 1 consecutive time	Configure this alarm so that you're notified if the `WarmToColdMigrationSuccessCount` x latency is greater than 24 hours if you’re trying to roll daily indexes.
`AlertingDegraded` is >= 1 for 1 minute, 1 consecutive time	Either the alerting index is red, or one or more nodes is not on schedule.
`ADPluginUnhealthy` is >= 1 for 1 minute, 1 consecutive time	The anomaly detection plugin isn't functioning properly, either because of high failure rates or because one of the indexes being used is red.
`AsynchronousSearchFailureRate` is >= 1 for 1 minute, 1 consecutive time	At least one asynchronous search failed in the last minute, which likely means the coordinator node failed. The lifecycle of an asynchronous search request is managed solely on the coordinator node, so if the coordinator goes down, the request fails.
`AsynchronousSearchStoreHealth` is >= 1 for 1 minute, 1 consecutive time	The health of the asynchronous search response store in the persisted index is red. You might be storing large asynchronous responses, which can destabilize a cluster. Try to limit your asynchronous search responses to 10 MB or less.
`SQLUnhealthy` is >= 1 for 1 minute, 3 consecutive times	The SQL plugin is returning 5xx response codes or passing invalid query DSL to OpenSearch. Troubleshoot the requests that your clients are making to the plugin.
`LTRStatus.red` is >= 1 for 1 minute, 1 consecutive time	At least one of the indexes needed to run the Learning to Rank plugin has missing primary shards and isn't functional.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Dedicated master nodes

General reference