Troubleshooting integration errors - AWS Lambda

Troubleshooting integration errors

Running an unintended function version or alias

When you publish new Lambda functions in the console or using AWS SAM, the latest code version is represented by the $LATEST alias. Once deployed, any event invoking this alias automatically runs the latest version of the code.

There are many situations when you may use other versions and aliases of Lambda functions. For example, you may use canary deployments to shift traffic between aliases as part of a deployment process. Or you may use Provisioned Concurrency, which can only be applied to a published version or alias of a function, not the $LATEST version. If you use Serverless Framework, deploying code always results in publishing a new function version.

In any of these cases, there are immutable published versions of a function in addition to the $LATEST version. When troubleshooting these functions, first determine that the caller has triggered the intended version or alias. Next, when checking the logs, verify the version of the function that has been invoked, which is always shown in the START log line:

debugging ops figure 1

Triggering infinite loops

There are two types of infinite loops in Lambda functions. The first is within the function itself, caused by a loop that never exits and the invocation only ends when the function times out. You can identify these by monitoring timeouts (see chapter 5) and the remediation steps should modify the looping behavior.

The second type of loop is between Lambda functions and other AWS resources. These occur when an event from a resource like an S3 bucket invokes a Lambda function, which then interacts with the same source resource to trigger another event. This invokes the function again, which in turn creates another interaction with the same S3 bucket, and so on. These types of loops can be caused by a number of different AWS event sources, including SQS queues and DynamoDB tables.

debugging ops figure 2
  1. An SQS queue publishes messages to a Lambda function, which creates new messages for the same SQS queue.

  2. A DynamoDB table triggers a Lambda function via a stream, which writes new items back to the same DynamoDB table.

  3. An S3 bucket invokes Lambda function, which then interacts with the same S3 bucket to produce another invocation.

Unlike a typical in-process loop, the scaling behaviors of the services can cause such loops to consume considerable resources. Lambda quickly scales up functions as more traffic arrives, which causes more interaction with the triggering resource. This continues to grow rapidly until Lambda meets its concurrency limit or the resource producing event is throttled by a Service Quota.

You can avoid these loops by ensuring that Lambda functions write to resources that are not the same as the consuming resource, and implementing circuit breakers for more complex loop patterns. If you must publish data back to the consuming resource, ensure that the new data does not trigger the same event, or the Lambda function can filter events. For example:

  • If you write back to the same S3 bucket, use a different prefix or suffix from the event trigger, or use an object meta tag that the consuming function can filter on. For a detailed example, see this repo.

  • If you write items to the same DynamoDB table, include an attribute that a consuming Lambda function can filter on, and exit if the attribute is found.

Downstream unavailability

For Lambda functions that call out to third-party endpoints or other downstream resources, you should ensure they can handle service errors and timeouts. These downstream resources may have variable response times or may become unavailable due to service disruptions. Depending upon the implementation, these downstream errors may appear as Lambda function timeouts or exceptions, if the service’s error response is not handled within the function’s custom code.

Anytime a function depends on a downstream service, such as an API call, you should implement appropriate error handling and retry logic. For critical services, the Lambda function should publish metrics or logs to CloudWatch, and you can then create alarms. For example, if a third-party payment API becomes unavailable, a Lambda function can log this information, then CloudWatch alarms can send a notification to you.

Since the Lambda service can scale invocations quickly, non-serverless downstream services may struggle to handle spikes in traffic. There are three common approaches to handling this:

  • Caching: where data in the third-party service does not change frequently or may return the same values across many requests, consider caching the result in either a global object in your function, or another service. For example, the results for a product list query from an Amazon RDS instance could be saved for a period of time within the function to prevent redundant queries.

  • Queuing: when saving or updating data, adding an SQS queue between the Lambda function and the resource can smooth out traffic. The queue durably persists data while the downstream service processes messages. For an example, see the Serverless Document Repository application.

  • Proxies: where long-lived connections are typically used, such as for RDS instances, use a proxy layer to pool and reuse those connections. For relational databases, Amazon RDS Proxy is a service designed to help improve scalability and resiliency in serverless applications.