Alerts from baseline monitoring in AMS - AMS Accelerate Operations Plan

Alerts from baseline monitoring in AMS

This section describes AMS Accelerate monitoring defaults; for more information, see Monitoring and event management in AMS Accelerate.

The following table shows what is monitored and the default alerting thresholds. You can change the alerting thresholds with a custom configuration document, or submit a service request. For instructions on changing your custom alarm configuration, see Changing the configuration. To be notified directly when alarms cross their threshold, in addition to AMS's standard alerting process, follow these instructions about how to overwrite alarm configurations, Tag-based Alarm Manager.

Amazon CloudWatch provides extended retention of metrics. For more information, see CloudWatch Limits.

Note

AMS Accelerate calibrates its baseline monitoring on a periodic basis. New accounts are always onboarded with the latest baseline monitoring and the table describes the baseline monitoring for an account that is newly onboarded. AMS Accelerate updates the baseline monitoring in existing accounts on a periodic basis and you may experience a delay before the updates are in place.

Alerts from baseline monitoring

Resource

Alert name and trigger condition

Notes

For starred (*) alerts, AMS proactively assesses impact and remediates when possible; if remediation is not possible, AMS creates an incident. Where automation fails to remediate the issue, AMS informs you of the incident case and an AMS engineer is engaged. In addition, these alerts can be sent directly to your email (if you have opted in to the Direct-Customer-Alerts SNS topic).

ALB instance

HTTPCode_Target_5XX_Count

sum > 0% for 1 min, 5 consecutive times.

CloudWatch alarm on excess number of HTTP 5XX response codes generated by the targets.

RejectedConnectionCount

sum > 0% for 1 min, 5 consecutive times.

CloudWatch alarm if the number of connections that were rejected because the load balancer reached its maximum

ALB target

TargetConnectionErrorCount

sum > 0% for 1 min, 5 consecutive times.

CloudWatch alarm if number of connections were unsuccessfully established between the load balancer and the registered instances.

Aurora

Average CPU utilization

> 90% for 20 mins, 5 consecutive times.

CloudWatch Alarm.

AWS Site-to-Site VPN

VPNTunnelDown

TunnelState <= 0 for 1 min, 20 consecutive times.

TunnelState is 0 when both tunnels are down, .5 when one tunnel is up, and 1.0 when both tunnels are up.

EC2 instance - all OSs

CPUUtilization*

> 95% for 5 mins, 6 consecutive times.

CloudWatch alarm. High CPU utilization is an indicator of a change in application state such as deadlocks, infinite loops, malicious attacks, and other anomalies.

This is a Direct-Customer-Alerts alarm.

StatusCheckFailed

> 0% for 5 minute , 3 consecutive times.

EC2 instance - Linux

Minimum mem_used_percent

>= 95% for 5 minutes, 6 consecutive times.

Average swap_used_percent

>= 95% for 5 minutes, 6 consecutive times.

Maximum disk_used_percent

>= 95% for 5 minutes, 6 consecutive times.

EC2 instance - Windows

Minimum Memory % Committed Bytes in Use

>= 95% for 5 minutes, 6 consecutive times.

Maximum LogicalDisk % Free Space

<= 5% for 5 minutes, 6 consecutive times.

OpenSearch cluster

ClusterStatus

red maximum is >= 1 for 1 minute, 1 consecutive time.

CloudWatch alarm. The KMS encryption key that is used to encrypt data at rest in your domain is disabled. Re-enable it to restore normal operations. To learn more, see Red Cluster Status.

OpenSearch domain

KMSKeyError

>= 1 for 1 minute, 1 consecutive time.

CloudWatch alarm. At least one primary shard and its replicas are not allocated to a node. To learn more, see Encryption of Data at Rest for Amazon OpenSearch Service.

KMSKeyInaccessible

>= 1 for 1 minute, 1 consecutive time.

ClusterStatus

yellow maximum is >= 1 for 1 minute, 1 consecutive time.

At least one replica shard is not allocated to a node. To learn more, see Yellow Cluster Status.

FreeStorageSpace

minimum is <= 20480 for 1 minute, 1 consecutive time.

A node in your cluster is down to 20 GiB of free storage space. To learn more, see Lack of Available Storage Space.

ClusterIndexWritesBlocked

>= 1 for 5 minutes, 1 consecutive time.

The cluster is blocking write requests. To learn more, see ClusterBlockException.

Nodes

minimum < x for 1 day, 1 consecutive time.

x is the number of nodes in your cluster. This alarm indicates that at least one node in your cluster has been unreachable for one day. To learn more, see Failed Cluster Nodes.

CPUUtilization

average >= 80% for 15 minutes, 3 consecutive times.

100% CPU utilization isn't uncommon, but sustained high averages are problematic. Consider right-sizing an existing instance types or adding instances.

JVMMemoryPressure

maximum >= 80% for 5 minutes, 3 consecutive times.

The cluster could encounter out of memory errors if usage increases. Consider scaling vertically. Amazon ES uses half of an instance's RAM for the Java heap, up to a heap size of 32 GiB. You can scale instances vertically up to 64 GiB of RAM, at which point you can scale horizontally by adding instances.

MasterCPUUtilization

average >= 50% for 15 minutes, 3 consecutive times.

Consider using larger instance types for your dedicated master nodes. Because of their role in cluster stability and blue/green deployments, dedicated master nodes should have lower average CPU usage than data nodes.

MasterJVMMemoryPressure

maximum >= 80% for 15 minutes, 1 consecutive time.

OpenSearch instance

AutomatedSnapshotFailure

maximum is >= 1 for 1 minute, 1 consecutive time.

CloudWatch alarm. An automated snapshot failed. This failure is often the result of a red cluster health status. To learn more, see Red Cluster Status.

ELB instance

SpilloverCountBackendConnectionErrors

> 1 for 1 minute , 15 consecutive times.

CloudWatch alarm if an excess number of requests that were rejected because the surge queue is full.

SurgeQueueLength

> 100 for 1 minute, 15 consecutive times.

CloudWatch alarm if an excess number of requests are pending routing.

GuardDuty Service

Not applicable; all findings (threat purposes) are monitored. Each finding corresponds to an alert.

Changes in the GuardDuty findings. These changes include newly generated findings or subsequent occurrences of existing findings.

List of supported GuardDuty finding types are on GuardDuty Active Finding Types.

Health

AWS Personal Health Dashboard

Notifications sent when there are changes in the status of AWS Personal Health Dashboard (AWS Health) events.Service event. Example: Scheduled EC2 instance store retirement.

Macie

Newly generated alerts and updates to existing alerts.

Macie finds any changes in the findings. These changes include newly generated findings or subsequent occurrences of existing findings.

Amazon Macie alert. For a list of supported Amazon Macie alert types, see Analyzing Amazon Macie findings. Note that Macie is not enabled for all accounts.

NATGateways

PacketsDropCount : Alarm if packetsdropcount is > 0 over 15 minutes period

A value greater than zero may indicate an ongoing transient issue with the NAT gateway.

ErrorPortAllocation : Alarm if NAT Gateways could not allocate port for over 15 minutes evaluation period

The number of times the NAT gateway could not allocate a source port. A value greater than Zero indicates that too many concurrent connecations are open..

RDS

Average CPU utilization

> 75% for 15 mins, 2 consecutive times.

CloudWatch alarms.

Sum of DiskQueueDepth

> 75% for 1 mins, 2 consecutive times.

Average FreeStorageSpace

< 1,073,741,824 bytes for 5 mins, 2 consecutive times.

Average ReadLatency

>= 1.001 seconds for 5 mins, 2 consecutive times.

Average WriteLatency

>= 1.005 seconds for 5 mins, 2 consecutive times.

Low Storage alert

Triggers when the allocated storage for the DB instance has been exhausted.

RDS-EVENT-0007, see details at Using Amazon RDS event notification.

DB instance fail

The DB instance has failed due to an incompatible configuration or an underlying storage issue. Begin a point-in-time-restore for the DB instance.

RDS-EVENT-0031, see details at Amazon RDS Event Categories and Event Messages.

RDS -0034 failover not attempted.

RDS is not attempting a requested failover because a failover recently occurred on the DB instance.

RDS-EVENT-0034, see details at Amazon RDS Event Categories and Event Messages.

RDS - 0035 DB instance invalid parameters

For example, MySQL could not start because a memory-related parameter is set too high for this instance class, so your action would be to modify the memory parameter and reboot the DB instance.

RDS-EVENT-0035, see details at Amazon RDS Event Categories and Event Messages.

Invalid subnet IDs DB instance

The DB instance is in an incompatible network. Some of the specified subnet IDs are invalid or do not exist.

Service event. RDS-EVENT-0036, see details at Amazon RDS Event Categories and Event Messages.

RDS-0045 DB instance read replica error

An error has occurred in the read replication process. For more information, see the event message. For information on troubleshooting Read Replica errors, see Troubleshooting a MySQL Read Replica Problem.

RDS-EVENT-0045, see details at Amazon RDS Event Categories and Event Messages.

RDS-0057 Error create statspack user account

Replication on the Read Replica was ended.

Service event. RDS-EVENT-0057, see details at Amazon RDS Event Categories and Event Messages.

RDS-0058 DB instance read replication ended

Error while creating Statspack user account PERFSTAT. Drop the account before adding the Statspack option.

Service event. RDS-EVENT-0058, see details at Amazon RDS Event Categories and Event Messages.

DB instance partial failover recovery complete

The instance has recovered from a partial failover.

Service event. RDS-EVENT-0065 see details at Amazon RDS Event Categories and Event Messages.

DB instance recovery start

The SQL Server DB instance is re-establishing its mirror. Performance will be degraded until the mirror is reestablished. A database was found with non-FULL recovery model. The recovery model was changed back to FULL and mirroring recovery was started. (<dbname>: <recovery model found>[,…])

Service event. RDS-EVENT-0066 see details at Amazon RDS Event Categories and Event Messages.

DB instance without enhanced monitoring

Enhanced Monitoring can't be enabled without the enhanced monitoring IAM role. For information about creating the enhanced monitoring IAM role, see To create an IAM role for Amazon RDS Enhanced Monitoring.

Service event. RDS-EVENT-0079 see details at Amazon RDS Event Categories and Event Messages.

DB instance enhanced monitoring disabled

Enhanced Monitoring was disabled due to an error making the configuration change. It's likely that the enhanced monitoring IAM role is configured incorrectly. For information about creating the enhanced monitoring IAM role, see To create an IAM role for Amazon RDS Enhanced Monitoring.

Service event. RDS-EVENT-0080 see details at Amazon RDS Event Categories and Event Messages.

Invalid permissions recovery S3 bucket

The IAM role that you use to access your Amazon S3 bucket for SQL Server native backup and restore is configured incorrectly. For more information, see Setting Up for Native Backup and Restore.

Service event. RDS-EVENT-0081 see details at Amazon RDS Event Categories and Event Messages.

Low storage alert when the DB instance has consumed more than 90% of its allocated storage.

Service event. RDS-EVENT-0089 see details at Amazon RDS Event Categories and Event Messages.

Notification service when scaling failed for the Aurora Serverless DB cluster.

Service event. RDS-EVENT-0143 see details at Amazon RDS Event Categories and Event Messages.

RedShift cluster

The health of the cluster

<= 0 for 5 min, 2 consecutive times

For more information, see Monitoring Amazon Redshift using CloudWatch metrics.

The maintenance mode of the cluster

>= 1 for 5 min, 1 consecutive time

The average amount of time taken for disk read

>= 1 for 5 min, 1 consecutive time

The average amount of time taken for disk write

>= 1 for 5 min, 1 consecutive time

For information on remediation efforts, see AMS automatic remediation of alerts.