Configuring provisioned concurrency - AWS Lambda

Configuring provisioned concurrency

In Lambda, concurrency is the number of in-flight requests that your function is handling at the same time. There are two types of concurrency controls available:

  • Reserved concurrency – Reserved concurrency is the maximum number of concurrent instances that you want to allocate to your function. When a function has reserved concurrency, no other function can use that concurrency. There is no charge for configuring reserved concurrency for a function.

  • Provisioned concurrency – Provisioned concurrency is the number of pre-initialized execution environments that you want to allocate to your function. These execution environments are prepared to respond immediately to incoming function requests. Configuring provisioned concurrency incurs charges to your AWS account.

This topic details how to manage and configure provisioned concurrency. For a conceptual overview of these two types of concurrency controls, see Reserved concurrency and provisioned concurrency. For more information on configuring reserved concurrency, see Configuring reserved concurrency.

Note

Lambda functions that an Amazon MQ event source mapping invokes have a default maximum concurrency. For Apache Active MQ, the maximum number of concurrent instances is 5. For Rabbit MQ, the maximum number of concurrent instances is 1. Setting reserved or provisioned concurrency for your function doesn't change these limits. To request an increase in the default maximum concurrency when using Amazon MQ, contact AWS Support.

Configuring provisioned concurrency

You can configure provisioned concurrency settings for a function using the Lambda console or the Lambda API.

To allocate provisioned concurrency for a function (console)
  1. Open the Functions page of the Lambda console.

  2. Choose the function you want to allocate provisioned concurrency for.

  3. Choose Configuration and then choose Concurrency.

  4. Under Provisioned concurrency configurations, choose Add configuration.

  5. Choose Reserve concurrency. Enter the amount of concurrency to reserve for the function.

  6. Choose the qualifier type, and alias or version.

    Note

    You cannot use provisioned concurrency with the $LATEST version of any function.

    In addition, if you're using an event source with your Lambda function, make sure that event source points to the correct alias or version. Otherwise, your function won't use provisioned concurrency environments.

  7. Enter a number under Provisioned concurrency. Lambda provides an estimate of monthly costs.

  8. Choose Save.

You can configure up to the Unreserved account concurrency in your account, minus 100. The remaining 100 units of concurrency are for functions that aren't using reserved concurrency. For example, if your account has a concurrency limit of 1,000, and you haven't assigned any reserved or provisioned concurrency to any of your other functions, you can configure a maximum of 900 provisioned concurrency units for a single function.


        An error occurs if you try to allocate too much provisioned concurrency.

Configuring provisioned concurrency for a function impacts the concurrency pool that's available to other functions. For example, if you configure 100 units of provisioned concurrency for function-a, other functions in your account must share the remaining 900 units of concurrency, even if function-a doesn't use all 100 provisioned concurrency units.

You can allocate both reserved concurrency and provisioned concurrency for the same function. If you do so, the amount of provisioned concurrency cannot exceed the amount of reserved concurrency.

This limit also applies to function versions. The maximum amount of provisioned concurrency you can allocate to a specific function version is equal to the function's reserved concurrency minus the provisioned concurrency on other function versions.

To configure provisioned concurrency with the Lambda API, use the following API operations.

For example, to configure provisioned concurrency with the AWS Command Line Interface (CLI), use the put-provisioned-concurrency-config command. The following command allocates 100 units of provisioned concurrency for the BLUE alias of a function named my-function:

aws lambda put-provisioned-concurrency-config --function-name my-function \ --qualifier BLUE \ --provisioned-concurrent-executions 100

You should see output that looks like the following:

{ "Requested ProvisionedConcurrentExecutions": 100, "Allocated ProvisionedConcurrentExecutions": 0, "Status": "IN_PROGRESS", "LastModified": "2023-01-21T11:30:00+0000" }

Accurately estimating required provisioned concurrency

If your function is currently serving traffic, you can easily view its concurrency metrics using CloudWatch metrics. Specifically, the ConcurrentExecutions metric shows you the number of concurrent invocations for each function in your account.


        Graph showing concurrency for a function over time.

The previous graph suggests that this function serves an average of 5 to 10 concurrent requests at any given time, and peaks at 20 requests on a typical day. Suppose that there are many other functions in your account. If this function is critical to your application and you need a low-latency response on every invocation, use a number greater than or equal to 20 as your provisioned concurrency setting.

Alternatively, recall that you can also calculate concurrency using the following formula:

Concurrency = (average requests per second) * (average request duration in seconds)

Multiplying average requests per second with the average request duration in seconds gives you a rough estimate of how much concurrency you need to reserve. You can estimate average requests per second using the Invocation metric, and the average request duration in seconds using the Duration metric. See Working with Lambda function metrics for more details.

When working with provisioned concurrency, Lambda suggests including a 10% buffer on top of the amount of concurrency your function typically needs. For example, if your function usually peaks at 200 concurrent requests, set your provisioned concurrency at 220 instead (200 concurrent requests + 10% = 220 provisioned concurrency).

Optimizing latency with provisioned concurrency

The way you structure your function code to optimize for latency can depend on whether you choose provisioned concurrency or on-demand environments. For functions running on provisioned concurrency, Lambda runs any initialization code (i.e. loading libraries and instantiating clients) at allocation time. So, putting as much initialization outside of the main function handler is a good idea, since doing so won't impact latency during actual function invocations. In contrast, if you initialize libraries or instantiate clients within your main handler code, your function has to run this each time you invoke it, regardless of whether or not you're using provisioned concurrency.

If you're using on-demand instances, Lambda may have to re-run your initialization code every time your function receives a request (cold start). Depending on what your function needs to achieve, you may choose to defer initialization for a specific capability until the function needs that capability. For example, consider the following control flow for a Lambda handler:

def handler(event, context): ... if ( some_condition ): // Initialize CLIENT_A to perform a task else: // Do nothing

In the previous example, instead of initializing CLIENT_A outside of the main handler, the function author chose to initialize it within the if statement. By doing this, Lambda only runs this code if some_condition is satisfied. If the author initializes CLIENT_A outside the main handler, Lambda runs that code on every cold start, increasing overall latency.

It's possible for your function to use up all of its provisioned concurrency. To handle excess traffic, your function has to use on-demand instances. To help you determine what type of initialization Lambda used for a particular environment, check the value of the AWS_LAMBDA_INITIALIZATION_TYPE environment variable. This variable can have two possible values: provisioned-concurrency or on-demand. the value of AWS_LAMBDA_INITIALIZATION_TYPE is immutable and does not change over the lifetime of the execution environment.

If you use the .NET 6 or .NET 7 runtimes, you can configure the AWS_LAMBDA_DOTNET_PREJIT environment variable to improve the latency for functions, even if they don't use provisioned concurrency. The .NET runtime lazily compiles and initializes each library that your code calls for the first time. As a result, the first invocation of a Lambda function can take longer than subsequent invocations. To mitigate this, you can choose one of three values for AWS_LAMBDA_DOTNET_PREJIT:

  • ProvisionedConcurrency: Lambda performs ahead-of-time JIT compilation for all environments using provisioned concurrency. This is the default value.

  • Always: Lambda performs ahead-of-time JIT compilation for every environment, even if the function doesn't use provisioned concurrency.

  • Never: Lambda disables ahead-of-time JIT compilation for all environments.

For provisioned concurrency environments, your function's initialization code runs during allocation, and every few hours as Lambda recycles active instances of your environment. You can see the initialization time in logs and traces after an environment instance processes a request. However, Lambda bills you for initialization even if the environment instance never processes a request. Provisioned concurrency runs continually and is billed separately from initialization and invocation costs. For details, see AWS Lambda Pricing.

For additional guidance on optimizing functions using provisioned concurrency, see Lambda execution environments in Serverless Land.

Managing provisioned concurrency with Application Auto Scaling

You can use Application Auto Scaling to manage provisioned concurrency on a schedule or based on utilization. If you observe predictable patterns of traffic to your function, use scheduled scaling. If you want your function to maintain a specific utilization percentage, use a target tracking scaling policy.

Scheduled scaling

With Application Auto Scaling, you can set your own scaling schedule according to predictable load changes. For more information and examples, see Scheduled scaling for Application Auto Scaling in the Application Auto Scaling User Guide, and Scheduling AWS Lambda Provisioned Concurrency for recurring peak usage on the AWS Compute Blog.

Target tracking

With target tracking, Application Auto Scaling creates and manages a set of CloudWatch alarms based on how you define your scaling policy. When these alarms activate, Application Auto Scaling automatically adjusts the amount of environments allocated using provisioned concurrency. Target tracking is ideal for applications that don't have predictable traffic patterns.

To scale provisioned concurrency using target tracking, use the RegisterScalableTarget and PutScalingPolicy Application Auto Scaling API operations. For example, if you're using the AWS Command Line Interface (CLI), follow these steps:

  1. Register a function's alias as a scaling target. The following example registers the BLUE alias of a function named my-function:

    aws application-autoscaling register-scalable-target --service-namespace lambda \ --resource-id function:my-function:BLUE --min-capacity 1 --max-capacity 100 \ --scalable-dimension lambda:function:ProvisionedConcurrency
  2. Apply a scaling policy to the target. The following example configures Application Auto Scaling to adjust the provisioned concurrency configuration for an alias to keep utilization near 70 percent.

    aws application-autoscaling put-scaling-policy \ --service-namespace lambda \ --scalable-dimension lambda:function:ProvisionedConcurrency \ --resource-id function:my-function:BLUE \ --policy-name my-policy \ --policy-type TargetTrackingScaling \ --target-tracking-scaling-policy-configuration '{ "TargetValue": 0.7, "PredefinedMetricSpecification": { "PredefinedMetricType": "LambdaProvisionedConcurrencyUtilization" }}'

You should see output that looks like this:

{ "PolicyARN": "arn:aws:autoscaling:us-east-2:123456789012:scalingPolicy:12266dbb-1524-xmpl-a64e-9a0a34b996fa:resource/lambda/function:my-function:BLUE:policyName/my-policy", "Alarms": [ { "AlarmName": "TargetTracking-function:my-function:BLUE-AlarmHigh-aed0e274-xmpl-40fe-8cba-2e78f000c0a7", "AlarmARN": "arn:aws:cloudwatch:us-east-2:123456789012:alarm:TargetTracking-function:my-function:BLUE-AlarmHigh-aed0e274-xmpl-40fe-8cba-2e78f000c0a7" }, { "AlarmName": "TargetTracking-function:my-function:BLUE-AlarmLow-7e1a928e-xmpl-4d2b-8c01-782321bc6f66", "AlarmARN": "arn:aws:cloudwatch:us-east-2:123456789012:alarm:TargetTracking-function:my-function:BLUE-AlarmLow-7e1a928e-xmpl-4d2b-8c01-782321bc6f66" } ] }

Application Auto Scaling creates two alarms in CloudWatch. The first alarm triggers when the utilization of provisioned concurrency consistently exceeds 70%. When this happens, Application Auto Scaling allocates more provisioned concurrency to reduce utilization. The second alarm triggers when utilization is consistently less than 63% (90 percent of the 70% target). When this happens, Application Auto Scaling reduces the alias's provisioned concurrency.

In the following example, a function scales between a minimum and maximum amount of provisioned concurrency based on utilization.


          Autoscaling provisioned concurrency with Application Auto Scaling target tracking.
Legend
  • Function instances

  • Open requests

  • Provisioned concurrency

  • Standard concurrency

When the number of open requests increase, Application Auto Scaling increases provisioned concurrency in large steps until it reaches the configured maximum. Once it reaches the maximum, the function can continue to scale on standard, unreserved concurrency if your account hasn't reached its account concurrency limit. When utilization drops and stays consistently low, Application Auto Scaling decreases provisioned concurrency in smaller periodic steps.

Both of the alarms that Application Auto Scaling manages use the average statistic by default. Functions that have traffic patterns that come in quick bursts may not trigger these alarms. For example, suppose your Lambda function executes quickly (i.e. 20-100 ms) and your traffic pattern comes in quick bursts. In this case, the number of requests may exceed allocated provisioned concurrency during the burst, but the burst load must sustain for at least 3 minutes for Application Auto Scaling to provision additional environments. Additionally, both CloudWatch alarms require 3 data points that hit the target average before activating the auto scaling policy.

For more information on target tracking scaling policies, see Target tracking scaling policies for Application Auto Scaling.