View a markdown version of this page

Durable execution SDK - AWS Lambda

Durable execution SDK

The durable execution SDK is the foundation for building durable functions. It provides primitives to checkpoint progress, handle retries, and manage execution flow. The SDK abstracts the complexity of checkpoint management and replay, letting you write sequential code that automatically becomes fault-tolerant.

The SDK is available for JavaScript, TypeScript, Python, and Java. For complete API documentation, quickstart tutorials, and language-specific guides, see the AWS Durable Execution SDK Developer Guide.

What the SDK does

Checkpoint management: The SDK automatically creates checkpoints as your function executes durable operations. Each checkpoint records the operation type, inputs, and results. When your function completes a step, the SDK persists the checkpoint before continuing. This ensures your function can resume from any completed operation if interrupted.

Replay coordination: When your function resumes after a pause or interruption, the SDK performs replay. It runs your code from the beginning but skips completed operations, using stored checkpoint results instead of re-executing them. The SDK ensures replay is deterministic. Given the same inputs and checkpoint log, your function produces the same results.

State isolation: The SDK maintains execution state separately from your business logic. Each durable execution has its own checkpoint log that other executions cannot access. The SDK encrypts checkpoint data at rest and ensures state remains consistent across replays.

For a detailed explanation of how checkpointing works and replay behavior, see Key concepts in the AWS Durable Execution SDK Developer Guide.

Durable operations

The SDK provides your function with a DurableContext object. This context replaces the standard Lambda context and provides methods for creating checkpoints, managing execution flow, and coordinating with external systems.

The DurableContext provides the following operations for building durable workflows:

Operation Description
Step Execute and checkpoint a unit of work with configurable retry strategies and execution semantics.
Wait Pause execution for a specified duration without consuming compute resources.
Wait for Condition Poll for a condition with automatic checkpointing between attempts.
Callback Pause execution and wait for an external system to provide input through the Lambda API.
Invoke Call another Lambda function and wait for its result, with automatic checkpointing.
Parallel Execute multiple operations concurrently with configurable completion policies.
Map Process each item in a collection concurrently with optional concurrency control.
Child Context Create an isolated execution context for grouping multiple operations.

Each durable operation creates checkpoints automatically, ensuring your function can resume from any point. For detailed API reference, code examples, and language-specific usage, see SDK Reference in the AWS Durable Execution SDK Developer Guide.

How durable operations are metered

Each durable operation you call through DurableContext creates checkpoints to track execution progress and store state data. These operations incur charges based on their usage, and the checkpoints may contain data that contributes to your data write and retention costs. Stored data includes invocation event data, payloads returned from steps, and data passed when completing callbacks. Understanding how durable operations are metered helps you estimate execution costs and optimize your workflows. For details on pricing, see the Lambda pricing page.

Payload size refers to the size of the serialized data that a durable operation persists. The data is measured in bytes and the size can vary depending on the serializer used by the operation. The payload of an operation could be the result itself for successful completions, or the serialized error object if the operation failed.

Basic operations

Basic operations are the fundamental building blocks for durable functions:

Operation Checkpoint timing Number of operations Data persisted
Execution Started 1 Input payload size
Execution Completed (Succeeded/Failed/Stopped) 0 Output payload size
Step Retry/Succeeded/Failed 1 + N retries Returned payload size from each attempt
Wait Started 1 N/A
WaitForCondition Each poll attempt 1 + N polls Returned payload size from each poll attempt
Invocation-level Retry Started 1 Payload for error object

Callback operations

Callback operations enable your function to pause and wait for external systems to provide input. These operations create checkpoints when the callback is created and when it's completed:

Operation Checkpoint timing Number of operations Data persisted
CreateCallback Started 1 N/A
Callback completion via API call Completed 0 Callback payload
WaitForCallback Started 3 + N retries (context + callback + step) Payloads returned by submitter step attempts, plus two copies of the callback payload

Compound operations

Compound operations combine multiple durable operations to handle complex coordination patterns like parallel execution, array processing, and nested contexts:

Operation Checkpoint timing Number of operations Data persisted
Parallel Started 1 + N branches (1 parent context + N child contexts) Up to two copies of the returned payload size from each branch, plus the statuses of each branch
Map Started 1 + N branches (1 parent context + N child contexts) Up to two copies of the returned payload size from each iteration, plus the statuses of each iteration
Promise helpers Completed 1 Returned payload size from the promise
RunInChildContext Succeeded/Failed 1 Returned payload size from the child context

For contexts, such as from runInChildContext or used internally by compound operations, results smaller than 256 KB are checkpointed directly. Larger results aren't stored—instead, they're reconstructed during replay by re-processing the context's operations.