Lambda function scaling - AWS Lambda

Lambda function scaling

Concurrency is the number of in-flight requests your AWS Lambda function is handling at the same time. For each concurrent request, Lambda provisions a separate instance of your execution environment. As your functions receive more requests, Lambda automatically handles scaling the number of execution environments until you reach your account's concurrency limit. By default, Lambda provides your account with a total concurrency limit of 1,000 across all functions in a region. To support your specific account needs, you can request a quota increase and configure function-level concurrency controls so that your critical functions don't experience throttling.

This topic explains concurrency and function scaling in Lambda. By the end of this topic, you'll be able to understand how to calculate concurrency, visualize the two main concurrency control options (reserved and provisioned), estimate appropriate concurrency control settings, and view metrics for further optimization.

Understanding and visualizing concurrency

Lambda invokes your function in a secure and isolated execution environment. To handle a request, Lambda must first initialize an execution environment (the Init phase), before using it to invoke your function (the Invoke phase):


        Typical lifecycle of an execution environment, showing Init and Invoke phases.
Note

Actual Init and Invoke durations can vary depending on many factors, such as the runtime you choose and the Lambda function code. The previous diagram isn't meant to represent the exact proportions of Init and Invoke phase durations.

The previous diagram uses a rectangle to represent a single execution environment. When your function receives its very first request (represented by the yellow circle with label 1, Lambda creates a new execution environment and runs the code outside your main handler during the Init phase. Then, Lambda runs your function's main handler code during the Invoke phase. During this entire process, this execution environment is busy and cannot process other requests.

When Lambda finishes processing the first request, this execution environment can then process additional requests for the same function. For subsequent requests, Lambda doesn't need to re-initialize the environment.


        An execution environment handling two requests in succession.

In the previous diagram, Lambda reuses the execution environment to handle the second request (represented by the yellow circle with label 2).

So far, we've focused on just a single instance of your execution environment (i.e. a concurrency of 1). In practice, Lambda may need to provision multiple execution environment instances in parallel to handle all incoming requests. When your function receives a new request, one of two things can happen:

  • If a pre-initialized execution environment instance is available, Lambda uses it to process the request.

  • Otherwise, Lambda creates a new execution environment instance to process the request.

For example, let's explore what happens when your function receives 10 requests:


        A Lambda function handling 10 requests. It must provision multiple environments to handle all requests.

In the previous diagram, each horizontal plane represents a single execution environment instance (labeled from A through F. Here's how Lambda handles each request:

Lambda behavior for requests 1 through 10
Request Lambda behavior Reasoning

1

Provisions new environment A

This is the first request; no execution environment instances available

2

Provisions new environment B

Existing execution environment instance A is busy

3

Provisions new environment C

Existing execution environment instances A and B are both busy

4

Provisions new environment D

Existing execution environment instances A, B, and C are all busy

5

Provisions new environment E

Existing execution environment instances A, B, C, and D are all busy

6

Reuses environment A

Execution environment instance A has finished processing request 1 and is now available

7

Reuses environment B

Execution environment instance B has finished processing request 2 and is now available

8

Reuses environment C

Execution environment instance C has finished processing request 3 and is now available

9

Provisions new environment F

Existing execution environment instances A, B, C, D, and E are all busy

10

Reuses environment D

Execution environment instance D has finished processing request 4 and is now available

As your function receives more concurrent requests, Lambda scales up the number of execution environment instances in response. The following animation tracks the number of concurrent requests over time:


        An animation illustrating concurrent requests over time.

By freezing the previous animation at six distinct points in time, we get the following diagram:


        Function concurrency at six distinct points in time.

In the previous diagram, we can draw a vertical line at any point in time and count the number of environments that intersect this line. This gives us the number of concurrent requests at that point in time. For example, at time t1, there are 3 active environments serving 3 concurrent requests. The maximum number of concurrent requests in this simulation occurs at time t4, when there are 6 active environments serving 6 concurrent requests.

To summarize, your function's concurrency is the number of concurrent requests that it's handling at the same time. In response to an increase in your function's concurrency, Lambda provisions more execution environment instances to meet request demand.

How to calculate concurrency

In general, concurrency of a system is the ability to process more than one task simultaneously. In Lambda, concurrency is the number of in-flight requests that your function is handling at the same time. A quick and practical way of measuring concurrency of a Lambda function is to use the following formula:

Concurrency = (average requests per second) * (average request duration in seconds)

Concurrency differs from requests per second. For example, suppose your function receives 100 requests per second on average. If the average request duration is 1 second, then it's true that the concurrency is also 100:

Concurrency = (100 requests/second) * (1 second/request) = 100

However, if the average request duration is 500 ms, the concurrency is 50:

Concurrency = (100 requests/second) * (0.5 second/request) = 50

What does a concurrency of 50 mean in practice? If the average request duration is 500 ms, you can think of an instance of your function as being able to handle 2 requests per second. Then, it takes 50 instances of your function to handle a load of 100 requests per second. A concurrency of 50 means that Lambda must provision 50 execution environment instances to efficiently handle this workload without any throttling. Here's how to express this in equation form:

Concurrency = (100 requests/second) / (2 requests/second) = 50

If your function receives double the number of requests (200 requests per second), but only requires half the time to process each request (250 ms), the concurrency is still 50:

Concurrency = (200 requests/second) * (0.25 second/request) = 50

Suppose you have a function that takes, on average, 20 ms to run. During peak load, you observe 5,000 requests per second. What is the concurrency of your function during peak load?

The average function duration is 20 ms, or 0.020 seconds. Using the concurrency formula, you can plug in the numbers to get a concurrency of 100:

Concurrency = (5,000 requests/second) * (0.020 seconds/request) = 100

Alternatively, an average function duration of 20 ms means that your function can process 50 requests per second. To handle the 5,000 request per second workload, you need 100 execution environment instances. Thus, the concurrency is 100:

Concurrency = (5,000 requests/second) / (50 requests/second) = 100

Reserved concurrency and provisioned concurrency

By default, your account has a concurrency limit of 1,000 across all functions in a region. Your functions share this pool of 1,000 concurrency on an on-demand basis. Your function experiences throttling (i.e. it starts to drop requests) if you run out of available concurrency.

Some of your functions might be more critical than others. As a result, you might want to configure concurrency settings to ensure that critical functions get the concurrency they need. There are two types of concurrency controls available: reserved concurrency and provisioned concurrency.

  • Use reserved concurrency to reserve a portion of your account's concurrency for a function. This is useful if you don't want other functions taking up all the available unreserved concurrency.

  • Use provisioned concurrency to pre-initialize a number of environment instances for a function. This is useful for reducing cold start latencies.

Reserved concurrency

If you want to guarantee that a certain amount of concurrency is available for your function at any time, use reserved concurrency.

Reserved concurrency is the maximum number of concurrent instances you want to allocate to your function. When you dedicate reserved concurrency to a function, no other function can use that concurrency. In other words, setting reserved concurrency can impact the concurrency pool that's available to other functions. Functions that don't have reserved concurrency share the remaining pool of unreserved concurrency.

Configuring reserved concurrency counts towards your overall account concurrency limit. There is no charge for configuring reserved concurrency for a function.

To better understand reserved concurrency, consider the following diagram:


          Function scaling behavior when the user configures reserved concurrency on critical functions.

In this diagram, your account concurrency limit for all the functions in this region is at the default limit of 1,000. Suppose you have two critical functions, function-blue and function-orange, that routinely expect to get high invocation volumes. You decide to give 400 units of reserved concurrency to function-blue, and 400 units of reserved concurrency to function-orange. In this example, all other functions in your account must share the remaining 200 units of unreserved concurrency.

The diagram has 5 points of interest:

  • At t1, both function-orange and function-blue begin receiving requests. Each function begins to use up their allocated portion of reserved concurrency units.

  • At t2, function-orange and function-blue are steadily receiving more requests. At the same time, you deploy some other Lambda functions, which begin receiving requests. You do not allocate reserved concurrency to these other functions. They begin using the remaining 200 units of unreserved concurrency.

  • At t3, function-orange hits the max concurrency of 400. Although there is unused concurrency elsewhere in your account, function-orange cannot access it. The red line indicates that function-orange is experiencing throttling, and Lambda may drop requests.

  • At t4, function-orange starts to receive fewer requests and is no longer throttling. However, your other functions experience a spike in traffic and begin throttling. Although there is unused concurrency elsewhere in your account, these other functions cannot access it. The red line indicates that your other functions are experiencing throttling.

  • At t5, other functions start to receive fewer requests and are no longer throttling.

From this example, notice that reserving concurrency has the following effects:

  • Your function can scale independently of other functions in your account. All of your account's functions in the same region that don't have reserved concurrency share the pool of unreserved concurrency. Without reserved concurrency, other functions can potentially use up all of your available concurrency. This prevents critical functions from scaling up if needed.

  • Your function can't scale out of control. Reserved concurrency puts a cap on your function's maximum concurrency. This means that your function can't use concurrency reserved for other functions, or concurrency from the unreserved pool. You can reserve concurrency to prevent your function from using all the available concurrency in your account, or from overloading downstream resources.

  • You may not be able to use all of your account's available concurrency. Reserving concurrency counts towards your account concurrency limit, but this also means that other functions cannot use that chunk of reserved concurrency. If your function doesn't use up all of the concurrency that you reserve for it, you're effectively wasting that concurrency. This isn't an issue unless other functions in your account could benefit from the wasted concurrency.

To manage reserved concurrency settings for your functions, see Configuring reserved concurrency.

Provisioned concurrency

You use reserved concurrency to define the maximum number of execution environments reserved for a Lambda function. However, none of these environments come pre-initialized. As a result, your function invocations may take longer because Lambda must first initialize the new environment before being able to use it to invoke your function. When initialization takes longer than expected, this is known as a cold start. To mitigate cold starts, you can use provisioned concurrency.

Provisioned concurrency is the number of pre-initialized execution environments you want to allocate to your function. If you set provisioned concurrency on a function, Lambda initializes that number of execution environments so that they are prepared to respond immediately to function requests.

When using provisioned concurrency, Lambda still recycles execution environments in the background. However, at any given time, Lambda always ensures that the number of pre-initialized environments is equal to the value of your function's provisioned concurrency setting. This behavior differs from reserved concurrency, where Lambda may completely terminate an environment after a period of inactivity. The following diagram illustrates this by comparing the lifecycle of a single execution environment when you configure your function using reserved concurrency compared to provisioned concurrency.


          How function environment behavior differs uder a reserved concurrency vs provisioned concurrency model.

The diagram has four points of interest:

Time Reserved concurrency Provisioned concurrency

t1

Nothing happens.

Lambda pre-initializes an execution environment instance.

t2

Request 1 comes in. Lambda must initialize a new execution environment instance.

Request 1 comes in. Lambda uses the pre-initialized environment instance.

t3

After some inactivity, Lambda terminates the active environment instance.

Nothing happens.

t4

Request 2 comes in. Lambda must initialize a new execution environment instance.

Request 2 comes in. Lambda uses the pre-initialized environment instance.

To better understand provisioned concurrency, consider the following diagram:


          Function scaling behavior when the user configures provisioned concurrency on a critical function.

In this diagram, you have an account concurrency limit of 1,000. You decide to give 400 units of provisioned concurrency to function-orange. All functions in your account, including function-orange, can use the remaining 600 units of unreserved concurrency.

The diagram has 5 points of interest:

  • At t1, function-orange begins receiving requests. Since Lambda has pre-initialized 400 execution environment instances, function-orange is ready for immediate invocation.

  • At t2, function-orange reaches 400 concurrent requests. As a result, function-orange runs out of provisioned concurrency. However, since there's still unreserved concurrency available, Lambda can use this to handle additional requests to function-orange (there's no throttling). Lambda must create new instances to serve these requests, and your function may experience cold start latencies.

  • At t3, function-orange returns to 400 concurrent requests after a brief spike in traffic. Lambda is again able to handle all requests without cold start latencies.

  • At t4, functions in your account experience a burst in traffic. This burst can come from function-orange or any other function in your account. Lambda uses unreserved concurrency to handle these requests.

  • At t5, functions in your account reach the maximum concurrency limit of 1,000, and experience throttling.

The previous example only considered provisioned concurrency. In practice, you can set both provisioned concurrency and reserved concurrency on a function. You might do this if you had a function that handles a consistent load of invocations, but routinely sees spikes of traffic during the weekends. In this case, you could use provisioned concurrency to set a baseline amount of environments to handle request during weekdays, and use reserved concurrency to handle the weekend spikes. Consider the following diagram:


          Function scaling behavior when you use both reserved and provisioned concurrency.

In this diagram, suppose that you configure 200 units of provisioned concurrency and 400 units of reserved concurrency for function-orange. Because you configured reserved concurrency, function-orange cannot use any of the 600 units of unreserved concurrency.

This diagram has 5 points of interest:

  • At t1, function-orange begins receiving requests. Since Lambda has pre-initialized 200 execution environment instances, function-orange is ready for immediate invocation.

  • At t2, function-orange uses up all its provisioned concurrency. function-orange can continue serving requests using reserved concurrency, but these requests may experience cold start latencies.

  • At t3, function-orange reaches 400 concurrent requests. As a result, function-orange uses up all its reserved concurrency. Since function-orange cannot use unreserved concurrency, requests begin to throttle.

  • At t4, function-orange starts to receive fewer requests, and no longer throttles.

  • At t5, function-orange drops down to 200 concurrent requests, so all requests are again able to use provisioned concurrency (i.e. no cold start latencies).

Both reserved concurrency and provisioned concurrency count towards your account concurrency limit and Regional quotas. In other words, allocating reserved and provisioned concurrency can impact the concurrency pool that's available to other functions. Configuring provisioned concurrency incurs charges to your AWS account.

Note

If the amount of provisioned concurrency on a function's versions and aliases adds up to the function's reserved concurrency, all invocations run on provisioned concurrency. This configuration also has the effect of throttling the unpublished version of the function ($LATEST), which prevents it from executing. You can't allocate more provisioned concurrency than reserved concurrency for a function.

To manage provisioned concurrency settings for your functions, see configuring provisioned concurrency. To automate provisioned concurrency scaling based on a schedule or application utilization, see managing provisioned concurrency with Application Auto Scaling.

Comparing reserved concurrency and provisioned concurrency

The following is a table summarizing and comparing reserved and provisioned concurrency.

Topic Reserved concurrency Provisioned concurrency

Definition

Maximum number of execution environment instances for your function.

Set number of pre-provisioned execution environment instances for your function.

Provisioning behavior

Lambda provisions new instances on an on-demand basis.

Lambda pre-provisions instances (i.e. before your function starts receiving requests).

Cold start behavior

Cold start latency possible, since Lambda must create new instances on-demand.

Cold start latency eliminated, since Lambda doesn't have to create instances on-demand.

Throttling behavior

Function throttled when reserved concurrency limit reached.

If reserved concurrency not set: function throttled when provisioned concurrency limit reached.

If reserved concurrency set: function throttled when reserved concurrency limit reached.

Default behavior if not set

Function uses unreserved concurrency available in your account.

Lambda doesn't pre-provision any instances. Instead, if reserved concurrency not set: function uses unreserved concurrency available in your account.

If reserved concurrency set: function uses reserved concurrency.

Pricing

No additional charge.

Incurs additional charges.

Concurrency quotas

Lambda sets quotas for the total amount of concurrency you can use across all functions in a region. These quotas exist on two levels:

  • At the account level, your functions can have up to 1,000 units of concurrency by default. To increase this limit, see Requesting a quota increase in the Service Quotas User Guide.

  • At the function level, you can reserve up to 900 units of concurrency across all your functions by default. 100 units of concurrency are always reserved for functions that don't explicitly reserve concurrency. For example, if you increased your account concurrency limit to 2,000, you can reserve up to 1,900 units of concurrency at the function level.

For an initial burst of traffic, your functions' cumulative concurrency in a Region can reach an initial level of between 500 and 3000:

Burst concurrency quotas

  • 3000 – US West (Oregon), US East (N. Virginia), Europe (Ireland)

  • 1000 – Asia Pacific (Tokyo), Europe (Frankfurt), US East (Ohio)

  • 500 – Other Regions

If your account concurrency limit is lower than the burst concurrency limit, Lambda limits your burst concurrency according to your account concurrency limit. For example, if your account limit is 1,000, then functions in US East (N. Virginia) get a burst concurrency of 1,000.

After the initial burst, your functions' concurrency can scale by an additional 500 instances per minute. For more information on burst concurrency, see Burst concurrency.

Accurately estimating required concurrency

If your function is currently serving traffic, you can easily view its concurrency metrics using CloudWatch metrics. Specifically, the ConcurrentExecutions metric shows you the number of concurrent invocations for each function in your account.


        Graph showing concurrency for a function over time.

From the graph, this function serves an average of 5 to 10 concurrent requests, and peaks at 20 requests on a typical day. Suppose that there are many other functions in your account. If this function is critical to your application and you don't want any dropped requests, you might use 20 as your reserved concurrency setting.

Recall that you can also calculate concurrency using the following formula:

Concurrency = (average requests per second) * (average request duration in seconds)

You can estimate average requests per second using the Invocation metric, and the average request duration in seconds using the Duration metric. See Working with Lambda function metrics for more details.

Note

If you choose provisioned concurrency, Lambda suggests including a 10% buffer on top of the amount of concurrency your function typically needs. Over-provisioning by 10% ensures that your function can always handle incoming requests using provisioned concurrency, even if you get slightly more traffic than expected. For example, if your function usually peaks at 200 concurrent requests, you might want to set your provisioned concurrency at 220 instead (200 concurrent requests + 10% = 220 provisioned concurrency).

Concurrency metrics

You can use the following metrics to monitor concurrency for your Lambda functions.

  • ConcurrentExecutions – The number of currently active concurrent invocations.

  • UnreservedConcurrentExecutions – The number of currently active concurrent invocations that are using unreserved concurrency.

  • ProvisionedConcurrentExecutions – The number of execution environment instances that are processing events on provisioned concurrency. For each invocation of an alias or version with provisioned concurrency, Lambda emits the current count.

  • ProvisionedConcurrencyInvocations – The number of times Lambda invokes your function code using provisioned concurrency.

  • ProvisionedConcurrencySpilloverInvocations – The number of times Lambda invokes your function code on standard (reserved or unreserved) concurrency when all provisioned concurrency is in use.

  • ProvisionedConcurrencyUtilization – For a version or alias, the value of ProvisionedConcurrentExecutions divided by the total amount of provisioned concurrency allocated. For example, .5 indicates that 50 percent of allocated provisioned concurrency is in use.