Amazon ECS managed instance draining - Amazon Elastic Container Service

Amazon ECS managed instance draining

Managed instance draining facilitates graceful termination of Amazon EC2 instances. This allows your workloads to stop safely and be rescheduled to non-terminating instances. Infrastructure maintenance and updates are performed without worrying about disruption to workloads. By using managed instance draining, you simplify your infrastructure management workflows that require replacement of Amazon EC2 instances while you ensure resilience and availability of your applications.

Amazon ECS managed instance draining works with Auto Scaling group instance replacements. Based on instance refresh and maximum instance lifetime, customers can ensure that they stay compliant with the latest OS and security mandates for their capacity.

Managed instance draining can only be used with Amazon ECS capacity providers and is enabled when creating or updating your Auto Scaling group capacity providers using the Amazon ECS console, AWS CLI, or SDK.

The following events are covered by Amazon ECS managed instance draining.

  • Auto Scaling group instance refresh ‐ Use instance refresh to perform rolling replacement of your Amazon EC2 instances in your Auto Scaling group instead of manually doing it in batches. This is especially useful when you need to replace a large number of instances. An instance refresh is initiated through the Amazon EC2 console or the StartInstanceRefresh API. Make sure you select Replace for Scale-in protection when calling StartInstanceRefresh if you're using managed termination protection.

  • Maximum instance lifetime ‐ You can define a maximum lifetime when it comes to replacing Auto Scaling group instances. This is helpful for scheduling replacement instances based on internal security policies or compliance.

  • Auto Scaling group scale-in ‐ Based on scaling policies and scheduled scaling actions, Auto Scaling group supports automatic scaling of instances. By using Auto Scaling group with Amazon ECS capacity providers, which facilitate cluster-autoscaling, you can scale-in Auto Scaling group instances when no tasks are running in them.

  • Auto Scaling group health checks ‐ Auto Scaling group supports many health checks to manage termination of unhealthy instances.

  • AWS CloudFormation stack updates ‐ You can add an UpdatePolicy attribute to your AWS CloudFormation to perform rolling updates when group changes.

  • Spot capacity rebalancing ‐ Auto Scaling group tries to proactively replace Spot instances that have a higher risk of interruption based on Amazon EC2 capacity rebalance notice. The Auto Scaling group terminates the old instance when the replacement is launched and healthy. Amazon ECS-managed instance draining drains the Spot instance that's terminating just like it drains a non-Spot instance.

  • Spot interruption ‐ Spot instances are terminated with a two minute notice. Amazon ECS-managed instance draining puts the instance in draining state in response.

Amazon EC2 Auto Scaling lifecycle hooks with managed instance draining

Auto Scaling group lifecycle hooks enable customer to create solutions that are triggered by certain events in the instance lifecycle and perform a custom action when that certain event occurs. An Auto Scaling group allows for up to 50 hooks. Multiple termination hooks can exist and are executed in parallel, and Auto Scaling group waits for all hooks to finish before terminating an instance.

In addition to the Amazon ECS-managed hook termination, you can also configure your own lifecycle termination hooks. Lifecycle hooks have a default action, and we recommend setting continue as the default to ensure other hooks, such as the Amazon ECS-managed hook, aren't impacted by any errors from custom hooks.

If you've already configured an Auto Scaling group termination lifecycle hook and also enabled Amazon ECS-managed instance draining, both lifecycle hooks will execute. The relative timings, however, are not guaranteed. Lifecycle hooks have a default action setting to specify the action to take when timeout elapses. In case of failures we recommend using continue as the default result in your custom hook. This ensures other hooks, particularly the Amazon ECS-managed hooks, aren't impacted by any errors in your custom lifecyle hook. The alternative result of abandon causes all other hooks to be skipped and should be avoided.

Tasks and managed instance draining

Amazon ECS managed instance draining uses the existing draining feature found in container instances. The container instance draining feature performs replacement and stops for replica tasks that belong to an Amazon ECS service. A standalone task, like one invoked by RunTask, that is in the pending or running state will remain unaffected. You'll have to wait for these to either complete or stop them manually. The container instance remains in the draining state until either all tasks are stopped or 48 hours has passed. Daemon tasks are the last to stop after all replica tasks have stopped.

Managed instance draining and managed termination protection

Since Amazon ECS managed instance draining facilitates graceful draining of Amazon EC2 instances, it keeps your applications from being disrupted by any termination events. Managed instance draining facilitates graceful shutdown of Amazon ECS service tasks for instances being scaled-in, even if managed termination is disabled.

The following table summarizes the behavior for different combinations of managed termination and managed draining.

Managed termination

Managed draining

Outcome

Enabled

Enabled Amazon ECS protects Amazon EC2 instances that are running tasks from being terminated by scale-in events. Any instances undergoing termination, such as those that don't have termination protection set, have received Spot interruption, or are forced by instance refresh are gracefully drained.

Disabled

Enabled Amazon ECS don't protect Amazon EC2 instances running tasks from being scaled-in. However, any instances that are being terminated are gracefully drained.

Enabled

Disabled Amazon ECS protects Amazon EC2 instances that are running tasks from being terminated by scale-in events. However, instances can still get terminated by Spot interruption or forced instance refresh, or if they aren't running any tasks. Amazon ECS doesn't perform graceful draining for these instances, and launches replacement service tasks after they stop.

Disabled

Disabled Amazon EC2 instances can be scaled-in or terminated at any time, even if they are running Amazon ECS tasks. Amazon ECS will launch replacement service tasks after they stop.
Managed instance draining and Spot instance draining

With Amazon ECS Spot instance draining, you can set an environment variable ECS_ENABLE_SPOT_INSTANCE_DRAINING on the Amazon ECS Agent which enables Amazon ECS to place an instance in the draining status in response to the two-minute Spot interruption. Amazon ECS managed instance draining facilitates graceful shutdown of Amazon EC2 instances undergoing termination due to many reasons, not just Spot interruption. For instance, you can use Amazon EC2 Auto Scaling capacity rebalancing to proactively replace Spot instances at elevated risk of interruption, and managed instance draining performs graceful shutdown of Spot instances being replaced. When you use managed instance draining, you don't need to enable Spot instance draining separately, so ECS_ENABLE_SPOT_INSTANCE_DRAINING in ASG user data is redundant.

Amazon ECS managed instance draining troubleshooting

Amazon ECS managed instance draining events are published to Amazon EventBridge, and Amazon ECS creates an EventBridge managed rule in your account’s default bus to support managed instance draining. You can filter these events to other AWS services like Lambda, Amazon SNS, and Amazon SQS to monitor and troubleshoot.

  • Amazon EC2 Auto Scaling sends an event to EventBridge whena lifecycle hook is invoked.

  • Spot interruption notices are published to EventBridge.

  • Amazon ECS generates error messages that are retrievable on the Amazon ECS console and API if there are any failures when provisioning managed instance draining resources.

  • EventBridge has retry mechanisms built in as mitigations for temporary failures.

Using Amazon ECS managed instance draining

You enable managed instance draining when you create or update your Auto Scaling group capacity providers using the Amazon ECS console and AWS CLI.

Note

Managed instance draining is enabled by default when you create a capacity provider.

The following are examples using the AWS CLI for creating a capacity provider with managed instance draining enabled and enabling managed instance draining for a cluster's existing capacity provider.

Create a capacity provider with managed instance draining enabled

To create a capacity provider with managed instance draining enabled, use the create-capacity-provider command.

aws ecs create-capacity-provider \ --name capacity-provider \ --auto-scaling-group-provider '{ "autoScalingGroupArn": "asg-arn", "managedScaling": { "status": "ENABLED", "targetCapacity": 100, "minimumScalingStepSize": 1, "maximumScalingStepSize": 1 }, "managedDraining": "ENABLED", "managedTerminationProtection": "ENABLED", }'

Response:

{ "capacityProvider": { "capacityProviderArn": "capacity-provider-arn", "name": "capacity-provider", "status": "ACTIVE", "autoScalingGroupProvider": { "autoScalingGroupArn": "asg-arn", "managedScaling": { "status": "ENABLED", "targetCapacity": 100, "minimumScalingStepSize": 1, "maximumScalingStepSize": 1 }, "managedTerminationProtection": "ENABLED" "managedDraining": "ENABLED" } } }
Enable managed instance draining for a cluster's existing capacity provider

Enable managed instance draining for a cluster's existing capacity provider uses the update-capacity-provider command. You see that managedDraining currently says DISABLED and updateStatus says UPDATE_IN_PROGRESS.

aws ecs update-capacity-provider \ --name cp-draining \ --auto-scaling-group-provider '{ "managedDraining": "ENABLED" }

Response:

{ "capacityProvider": { "capacityProviderArn": "cp-draining-arn", "name": "cp-draining", "status": "ACTIVE", "autoScalingGroupProvider": { "autoScalingGroupArn": "asg-draining-arn", "managedScaling": { "status": "ENABLED", "targetCapacity": 100, "minimumScalingStepSize": 1, "maximumScalingStepSize": 1, "instanceWarmupPeriod": 300 }, "managedTerminationProtection": "DISABLED", "managedDraining": "DISABLED" // before update }, "updateStatus": "UPDATE_IN_PROGRESS", // in progress and need describe again to find out the result "tags": [ ] } }

Use the describe-clusters command and include ATTACHMENTS. The status of the managed instance draining attachment is PRECREATED, and the overall attachmentsStatus is UPDATING.

aws ecs describe-clusters --clusters cluster-name --include ATTACHMENTS

Response:

{ "clusters": [ { ... "capacityProviders": [ "cp-draining" ], "defaultCapacityProviderStrategy": [], "attachments": [ # new precreated managed draining attachment { "id": "a1b2c3d4-5678-90ab-cdef-EXAMPLE11111", "type": "managed_draining", "status": "PRECREATED", "details": [ { "name": "capacityProviderName", "value": "cp-draining" }, { "name": "autoScalingLifecycleHookName", "value": "ecs-managed-draining-termination-hook" } ] }, ... ], "attachmentsStatus": "UPDATING" } ], "failures": [] }

When the update is finished, use describe-capacity-providers, and you see managedDraining is now ENABLED.

aws ecs describe-capacity-providers --capacity-providers cp-draining

Response:

{ "capacityProviders": [ { "capacityProviderArn": "cp-draining-arn", "name": "cp-draining", "status": "ACTIVE", "autoScalingGroupProvider": { "autoScalingGroupArn": "asg-draning-arn", "managedScaling": { "status": "ENABLED", "targetCapacity": 100, "minimumScalingStepSize": 1, "maximumScalingStepSize": 1, "instanceWarmupPeriod": 300 }, "managedTerminationProtection": "DISABLED", "managedDraining": "ENABLED" // successfully update }, "updateStatus": "UPDATE_COMPLETE", "tags": [] } ] }