AMS automatic remediation of alerts - AMS Accelerate User Guide

AMS automatic remediation of alerts

Some alerts are automatically remediated by AWS Managed Services (AMS). This section describes how this remediation works and the conditions that must be met for the remediation to take place.

Alert name Description Remediation

Status Check Failed

This alarm indicates that the instance is running on degraded hardware or entered a fault state.

Our remediation first validates instance accessibility. If confirmed that accessibility is impacted, it stops the instance and starts it again so it can be migrated to new underlying hardware.

AMSLinuxDiskUsage

This alarm indicates that a mount point of your Linux EC2 instance is filling up.

The remediation first deletes temporary files. If this does not free up required space, it extends the volume to prevent downtime if the volume were to get full.

AMSWindowsDiskUsage

This alarm indicates that a drive of your Windows EC2 instance is filling up.

The remediation first deletes temporary files. If this does not free up required space, it extends the volume to prevent downtime if the volume were to get full.

RDS-EVENT-0089

This alarm indicates that the DB instance has consumed more than 90% of its allocated storage.

The remediation first validates the DB is in a modifiable and available/storage-full state. It will attempt to increase the allocated storage, IOPS and storage throughput via cloudformation changeset, if stack drift is already detected it will fall back to RDS API to prevent downtime.

RDS-EVENT-0007

This alarm indicates that the allocated storage for the DB instance has been exhausted.

The remediation first validates the DB is in a modifiable and available/storage-full state. It will attempt to increase the allocated storage, IOPS and storage throughput via cloudformation changeset, if stack drift is already detected it will fall back to RDS API to prevent downtime.

EC2 status check failure remediation automation

These are some notes about how AWS Managed Services (AMS) auto-remediation works with EC2 status check failure issues.

  • Your EC2 instance has become unreachable. In order to recover it, it must be stopped and started again so it's migrated to new hardware.

  • The automation is not able to recover your instance if the root of the problem is within the OS. For example, missing devices in fstab, kernel corruption, and so on.

  • If your instance belongs to an Auto Scaling group, the automation takes no action. The autoscaling replaces the instance.

  • The remediation doesn't take action if EC2 Auto Recovery is enabled for this instance.

EC2 volume usage remediation automation

How AWS Managed Services (AMS) auto-remediation works with EC2 volume usage issues.

  • Before extending a volume, the automation performs cleanup tasks (Windows: Disk Cleaner Linux: Logrotate + Simple Service Manager Agent Log removal) on the instance to try to free up space.

    Note

    The cleanup tasks are not run on EC2 "T" family instances due to their reliance on CPU credits for continued functionality.

  • On Linux, the automation only supports extending file systems of type EXT2, EXT3, EXT4 and XFS.

  • On Windows, the automation only supports New Technology File System (NTFS) and Resilient File System (ReFS).

  • The automation doesn't extend volumes that are part of Logical Volume Manager (LVM) or a RAID array.

  • The automation does not extend instance store volumes.

  • The automation does not take action if the affected volume is already bigger than 2 TiB.

  • The capacity expansion portion of the automation occurs once every 6 hours with a 3-time weekly and 5-time lifetime volume expansion limit.

When these rules prevent the automation from taking action, AMS reaches out to you through an outbound service request to determine the next actions to take.

Amazon RDS low storage event remediation automation

How AWS Managed Services (AMS) auto-remediation works with Amazon RDS low storage event issues.

  • Before trying to extend the Amazon RDS instance storage, the automation performs several checks to ensure the Amazon RDS instance is in a modifiable and available, or storage-full, state.

  • Where CloudFormation stack drift is detected, remediation occurs through Amazon RDS API.

  • The remediation action does not run in the following scenarios:

    • The Amazon RDS instance status is not "available" or "storage-full".

    • The Amazon RDS instance storage is not currently modifiable (such as when the storage has been modified in the last 6 hours).

    • The Amazon RDS instance has auto-scaling storage enabled.

    • The Amazon RDS instance is not a resource within a CloudFormation stack.

  • Remediation is limited to 1 expansion per 6 hours and no more than 3 expansions within a rolling fourteen day period.

  • Where the above states are met, AMS reaches out to you with an outbound incident to determine next actions.