Best Practices - Speeding up task launch - Amazon Elastic Container Service

Best Practices - Speeding up task launch

There are several improvements that you can make to shorten the time that it takes Amazon ECS to launch your tasks.

Amazon ECS Task launch workflow

Understanding how Amazon ECS provisions your tasks is helpful in reasoning about optimizations to speed up your task launches. When you launch Amazon ECS tasks (standalone tasks or by Amazon ECS services), a task is created and initially put into the PROVISIONING state before it is successfully launched into RUNNING state (for details, see Task lifecycle in the Amazon ECS Developer Guide). In the PROVISIONING state, neither the task nor the containers exist as Amazon ECS needs to find compute capacity for placing the task.

Amazon ECS selects the appropriate compute capacity for your task based on your launch type or capacity provider configuration. The launch types are AWS Fargate (Fargate) and Amazon EC2 on AWS, and the EXTERNAL type used with Amazon ECS Anywhere. Capacity providers and capacity provider strategies can be used with both the Fargate and Amazon EC2 launch types. With Fargate, you don’t have to think about provisioning, configuring, and scaling of your cluster capacity. Fargate takes care of all infrastructure management for your tasks. For Amazon ECS with Amazon EC2, you can either manage your cluster capacity by registering Amazon EC2 instances to your cluster, or you can use Amazon ECS Cluster Auto Scaling (CAS) to simplify your compute capacity management. CAS takes care of dynamically scaling your cluster capacity, so that you can focus on just running tasks. Amazon ECS determines where to place the task based on the requirements you specify in the task definition, such as CPU and memory, as well your placement constraints and strategies. For more details on task placement, see Amazon ECS task placement.

After finding the capacity for placing your task, Amazon ECS provisions the necessary attachments (e.g. Elastic Network Interfaces (ENIs) for tasks in awsvpc mode), and uses the Amazon ECS container agent to pull your container images and start your containers. Once all this completes and the relevant containers have launched, Amazon ECS moves the task into RUNNING state.

Amazon ECS Service Scheduler workflow

Amazon ECS provides a service scheduler for managing the state of your services. The service scheduler ensures that the scheduling strategy that you specify is followed and reschedules failing tasks. For example, if the underlying infrastructure fails, the service scheduler can reschedule tasks. A key responsibility of the service scheduler is to ensure that your application is always running the desired number of tasks – based on the desired count that you specify in service configuration or the auto scaled count of tasks based on application load if you use service autoscaling. The service scheduler uses asynchronous workflows to launch tasks in batches. To understand how the service scheduler functions, imagine that you create an Amazon ECS service for a large web API that receives heavy traffic. You expect this service to serve a lot of web traffic, and determine that the appropriate desired count for the service is 1,000 tasks. When you deploy this service, Amazon ECS service scheduler will not launch all 1,000 tasks at once. Instead, it will begin executing workflow cycles to bring the current state (0 tasks) towards the desired state (1,000 tasks), with each workflow cycle launching a batch of new tasks. The service scheduler can provision up to 500 tasks for Fargate and up to 250 tasks for Amazon EC2 per service per minute. For more information about the allowed rates and quotas in Amazon ECS, see Amazon ECS service quotas.

Now that you have an understanding of Amazon ECS task launch workflow, let’s discuss how you can use some of this knowledge to speed-up your task launches.

Recommendations to speed up task launch

As discussed in the previous section, the time taken between the triggering of task launch (via Amazon ECS APIs or service scheduler) and the successful start-up of your containers is affected by a variety of factors within Amazon ECS, your configurations, and the container itself. In order to speed up your task launches, consider the following recommendations.

  • Cache container images and binpack instances.

    If you are running Amazon ECS on Amazon EC2, you can configure the Amazon ECS container agent to cache previously used container images to reduce image pull-time for subsequent launches. The effect of caching is even greater when you have a high task density in your container instances, which you can configure using the binpack placement strategy. Caching container images is especially beneficial for windows-based workloads which usually have large (tens of GBs) container image sizes. When using the binpack placement strategy, you can also consider using Elastic Network Interface (ENI) trunking to place more tasks with the awsvpc network mode on each container instance. ENI trunking increases the number of tasks you can run on awsvpc mode. For example, a c5.large instance that may support running only 2 tasks concurrently, can run up to 10 tasks with ENI trunking.

  • Choose an optimal network mode.

    Although there are many instances where awsvpc network mode is ideal, this network mode can inherently increase task launch latency – for each task in awsvpc mode, Amazon ECS workflows need to provision and attach an ENI by invoking Amazon EC2 APIs which adds an overhead of several seconds to your task launches. By contrast, a key advantage of using awsvpc network mode is that each task has a security group to allow or deny traffic. This means you have greater flexibility to control communications between tasks and services at a more granular level. If the benefits of deployment speed outweigh benefits from awsvpc mode, you can consider using bridge mode to speed up task launches. For further reading on relative advantages of each network mode, see AWSVPC mode and Bridge mode.

  • Track your task launch lifecycle to find optimization opportunities.

    It is often difficult to realize the amount of time it takes for your application to start-up. Launching your container image, running start-up scripts, and other configurations during application start-up can take a surprising amount of time. You can use the ECS Agent Metadata endpoint to post metrics to track application start-up time from ContainerStartTime to when your application is ready to serve traffic. With this data, you can understand how your application is contributing to the total launch time, and find areas where you can reduce unnecessary application-specific overhead and optimize your container images.

  • Choose an optimal instance type (when using Amazon ECS on Amazon EC2).

    Choosing the correct Instance type is based on the resource reservation (i.e. CPU, Memory, ENI, GPU) that you configure on your task, hence when sizing the instance, you can calculate how many tasks can be placed on a single instance. A simple example of a well-placed task, will be hosting 4 tasks requiring 0.5 vCPU and 2GB of memory reservations in an m5.large instance (supporting 2 vCPU and 8 GB memory). The reservations of this task definition take full advantage of the instance’s resources.

  • Use Amazon ECS service scheduler to concurrently launch services.

    As discussed in the previous section, the service scheduler can concurrently launch tasks for multiple services using asynchronous workflows. Thus, you can achieve faster deployment speed by designing your applications as smaller services with fewer tasks rather than a large service with a large number of tasks. For instance, instead of having a single service with 1,000 tasks, having 10 services each with 100 tasks will result in a much faster deployment speed, since service scheduler will initiate task provisioning for all services in parallel.