AMS automatic remediation of alerts - AMS Accelerate Operations Plan

AMS automatic remediation of alerts

Some alerts are automatically remediated by AMS. This section describes how this remediation works and the conditions that must be met for the remediation to take place.

Alert name Description Remediation

Status Check Failed

This alarm indicates that the instance is running on degraded hardware or entered a fault state.

Our remediation first validates instance accessibility. If confirmed that accessibility is impacted, it stops the instance and starts it again so it can be migrated to new underlying hardware.

Root Volume Usage

This alarm indicates that the root volume (C: Drive in Windows) of your EC2 instance is filling up.

The remediation first deletes temporary files. If this does not free up required space, it extends the volume to prevent downtime if the volume were to get full.

Non-Root Volume Usage

This alarm indicates that an attached volume (not root or C:) is filling up.

The remediation first deletes temporary files. If this does not free up required space, it extends the volume to prevent downtime if the volume were to get full.

RDS-EVENT-0089

This alarm indicates that the DB instance has consumed more than 90% of its allocated storage.

The remediation first validates the DB is in a modifiable and available/storage-full state. It will attempt to increase the allocated storage via cloudformation changeset, if stack drift is already detected it will fall back to RDS API to prevent downtime.

RDS-EVENT-0007

This alarm indicates that the allocated storage for the DB instance has been exhausted.

The remediation first validates the DB is in a modifiable and available/storage-full state. It will attempt to increase the allocated storage via cloudformation changeset, if stack drift is already detected it will fall back to RDS API to prevent downtime.

EC2 status check failure remediation automation

These are some notes about how AMS auto-remediation works with EC2 status check failure issues.

  • Your EC2 instance has become unreachable. In order to recover it, it must be stopped and started again so it's migrated to new hardware.

  • The automation is not able to recover your instance if the root of the problem is within the OS.; for example, missing devices in fstab, kernel corruption, and so on.

  • If your instance belongs to an Auto Scaling group, the automation takes no action. The autoscaling replaces the instance.

  • The remediation doesn't take action if EC2 Auto Recovery is enabled for this instance.

EC2 volume usage remediation automation

How AMS auto-remediation works with EC2 volume usage issues.

  • Before trying to extend the volume, the automation performs cleanup tasks (Windows: Disk Cleaner Linux: Logrotate + Simple Service Manager Agent Log removal) on the instance to try to free up space.

  • This cleanup step will not be run on EC2 "T" family instances due to its reliance on CPU credits for continued functionality.

  • The automation doesn't take action if the affected volume is already bigger than 2 TiB.

  • The automation doesn't extend volumes that are part of Logical Volume Manager (LVM) or RAID.

  • On Linux, the automation only supports extending file systems of type EXT2, EXT3, EXT4 and XFS.

  • On Windows, the automation only supports New Technology File System (NTFS) and Resilient File System (ReFS).

  • The automation doesn't extend instance stored backed volumes.

  • The capacity expansion portion of the automation only occurs once every 6 hours with a 3-time volume expansion lifetime limit.

Under these EC2 volume usage issues, AMS reaches out to you through an outbound service request to determine the next actions to take.

Amazon RDS low storage event remediation automation

How AMS auto-remediation works with Amazon RDS low storage event issues.

  • Before trying to extend the Amazon RDS instance storage, the automation performs several checks to ensure the Amazon RDS instance is in a modifiable and available, or storage-full, state.

  • Where CloudFormation stack drift is detected, remediation occurs through Amazon RDS API.

  • The remediation action does not run in the following scenarios:

    • The Amazon RDS instance status is not "available" or "storage-full".

    • The Amazon RDS instance storage is not currently modifiable (such as when the storage has been modified in the last 6 hours).

    • The Amazon RDS instance has auto-scaling storage enabled.

    • The Amazon RDS instance is not a resource within a CloudFormation stack.

  • Remediation is limited to 1 expansion per 6 hours and no more than 3 expansions within a rolling fourteen day period.

  • Where the above states are met, AMS reaches out to you with an outbound incident to determine next actions.