Monitoring concepts in Lambda-based applications - AWS Lambda

Monitoring concepts in Lambda-based applications

The main concepts that apply to monitoring any distributed system also apply to Lambda-based applications. Observability relies on several key terms:

  • Metrics: Numeric data measured at various time intervals (time series data) and service-level indicators (request rate, error rate, duration, CPU, etc.) The Lambda service automatically publishes a number of metrics for Lambda functions and you can create new metrics for your specific use-case. The process of exposing new metrics from your code is called instrumentation.

  • Logs: Timestamped records of discrete events that happened within an application or system, such as a failure, an error, or a state transformation. In Lambda, the default logging service is Amazon CloudWatch. You can also use third-party logging systems, if preferred.

  • Alerts: Monitoring systems observe metrics in relation to thresholds, and can provide alerts if metrics fall outside expected bounds or become anomalous. This enables operators to receive notifications instead of constantly monitoring the system. For Lambda, CloudWatch alarms are used for this purpose.

  • Visualization: Converting metrics to a visual format that enables fast, intuitive interpretation of the performance. These can then be grouped in dashboards.

  • Tracing: Following a single request end-to-end throughout a system composed of multiple microservices.

Application monitoring can also help you to detect performance problems, outages, and errors in your workloads. Since Lambda-based applications often combine multiple services, it’s important that you monitor each service endpoint. AWS provides integrated tools to allow you to view the performance, throughput, and errors of event sources for Lambda functions, together with the code in the functions.

Many existing application performance management (APM) concepts still apply in the monitoring and management of serverless applications to maintain an expected level of service. Broadly, there are two sets of performance metrics. First, end user experience metrics, which focus on load-based values such as average response time. The performance profile of applications can change under load and highlight issues that are not found during development. End-to-end response times for an end user action also provide a measurement for application performance.

Second, there are resource management metrics, which can highlight if systems are running close to resource limits. These can help ensure that a workload is making efficient use of the available resources. Unlike traditional server-based applications, instead of measuring if there are adequate compute resources available under peak load, typically you measure the concurrency of services and how closely your application is running within Service Quotas.

For monitoring and observing Lambda functions, the most important broad metrics are:

  • Errors: whether errors are caused by logic or runtime errors in the code, or caused by interactions with the Lambda service or other services. These may also be caused by other factors such as lack of permissions or exceeding the assigned resources.

  • Execution time: measuring average response times only provides a limited view of performance in distributed applications. It’s important to capture and monitor performance at percentile intervals (such as 95% and 99%) to measure the performance for the slowest 5% and 1% of requests.

  • Throttling: serverless applications use scalable resources with Service Quotas to protect customers. Throttling may indicate that quotas are set incorrectly, there is an error in the application architecture, or traffic levels are beyond the expected limits.