Important metrics for CloudWatch - AWS Lambda

Important metrics for CloudWatch

Lambda reports some metrics directly to the CloudWatch service and these do not appear in the logs. With CloudWatch, you can create alarms that monitor metrics and provide notifications if metrics exceed typical values. You can also create composite alarms that combine multiple alarms and provide more useful notifications. You can create alarms either manually in the console or in an AWS SAM template, so the alarm is defined together with the resources of your application.

There are a number of important integrated Lambda metrics that you can monitor to understand the performance of your workload:

  • Invocations: Monitor this value as a general barometer for the amount of traffic flowing through your serverless application. Also, monitor for functions with zero invocations over a given period of time, since this can indicate that there are upstream problems in the application. It can also indicate that a function is no longer being used.

  • Duration: This is the amount of time taken for a Lambda invocation. Apart from the impact on cost, it’s also important to monitor any functions that are running close to their timeout value.

  • Errors: This logs the number of errors thrown by a function. It can be used with the Invocations metric to calculate the total percentage of errors.

  • Throttles: Set alarms on this metric for any non-zero value since this only occurs if the number of invocations exceeds concurrency in your account. You can set Reserved Concurrency on critical functions, as discussed in Application design, and also request Service Quota increases, if needed.

  • DeadLetterErrors: An error is triggered if Lambda cannot write to the designated dead-letter queue, so you should alarm on any non-zero values for this metric.

  • IteratorAge: For Lambda functions that poll streaming sources, such as Kinesis or DynamoDB streams, this value indicates when events are being produced faster than they are being consumed by Lambda. IteratorAge is the difference between the current time and when the last record of the GetRecords call was written to the stream.

  • ConcurrentExecutions: monitor this value to ensure that your functions are not running close to the total concurrency limit for your AWS account. You can request a Service Quota increase, if needed.

  • UnreservedConcurrentExecutions: similar to the previous metric but excludes functions using reserved concurrency.

There are also additional metrics to monitor the performance of asynchronous event processing:

  • AsyncEventsReceived: A measure of the total number of events Lambda was able to successfully queue for processing. Alarm on undesirable number of events sent by an event source to diagnose trigger misconfigurations or runaway functions.

  • AsyncEventAge: A measure of time between Lambda successfully queuing the event and invoking the function. This metric increases if events are getting retried due to invocation failures or throttles. You can use this metric to alarm on processing delays.

  • AsyncEventsDropped: A measure of the total number of events that were dropped without successfully invoking the function. If an OnFailure destination or Dead-Letter-Queue (DLQ) is configured, events are sent to it before being dropped.

Common Lambda errors often have distinct patterns in standard Lambda metrics. For example, a function that is running normally but starts to experience timeouts may generate Duration and Error count metrics charts as shown below. The Duration value is tightly grouped around a second-level interval (1.00–1.02 seconds) and the error count matches the Invocations count, indicating that every invocation is failing.

               monitoring observability figure 4