Key Concepts¶
Durable execution¶
A durable execution is the complete lifecycle of an AWS Lambda durable function. It uses a checkpoint and replay mechanism to track progress, suspend execution, and recover from failures. When functions resume after suspension or interruptions, previously completed checkpoints replay and the function continues execution.
The execution lifecycle could include multiple invocations of the Lambda function to complete, particularly after suspensions or failure recovery. With these replays the execution can run for extended periods (up to one year) while maintaining reliable progress despite interruptions.
Timeouts¶
The execution timeout and Lambda function Timeout are different settings. The Lambda function timeout controls how long each individual invocation can run (maximum 15 minutes). The execution timeout controls the total elapsed time for the entire durable execution (maximum 1 year).
Durable functions¶
A durable function is a Lambda function configured with the
DurableConfig
object at creation time. Lambda will then apply the checkpoint and replay mechanism to
the function's execution to make it durable at invocation time.
DurableContext¶
DurableContext is the context object your durable function receives instead of the
standard Lambda Context. It exposes all durable operations and provides methods for
creating checkpoints, managing execution flow, and coordinating with external systems.
Your durable function receives a DurableContext instead of the default Lambda context:
import { DurableContext, withDurableExecution } from "@aws/durable-execution-sdk-js";
export const handler = withDurableExecution(
async (event: any, context: DurableContext) => {
// Your function receives DurableContext instead of Lambda context
// Use context.step(), context.wait(), etc.
const result = await context.step("my-step", async () => {
return "step completed";
});
return result;
},
);
from aws_durable_execution_sdk_python import DurableContext, durable_execution, durable_step, StepContext
@durable_step
def my_step(ctx: StepContext, data: dict) -> str:
# Your business logic
return f"step completed with {data}"
@durable_execution
def handler(event: dict, context: DurableContext):
# Your function receives DurableContext instead of Lambda context
# Use context.step(), context.wait(), etc.
result = context.step(my_step(event["data"]))
return result
import software.amazon.lambda.durable.DurableContext;
import software.amazon.lambda.durable.DurableHandler;
public class DurableContextExample extends DurableHandler<Object, String> {
@Override
public String handleRequest(Object input, DurableContext context) {
// Your function receives DurableContext instead of Lambda context
// Use context.step(), context.wait(), etc.
return context.step("my-step", String.class, ctx -> "step completed");
}
}
Operations¶
Operations are units of work in a durable execution. Each operation type serves a specific purpose:
- Steps Execute business logic with automatic checkpointing and configurable retry
- Waits Suspend execution for a duration without consuming compute resources
- Callbacks Suspend execution and wait for an external system to submit a result
- Invoke Invoke another Lambda function and checkpoint the result
- Parallel Execute multiple independent operations concurrently
- Map Execute an operation on each item in an array concurrently with optional concurrency control
- Child context Group operations into an isolated context for sub-workflow organization and concurrent determinism
- Wait for condition Poll for a condition with automatic checkpointing between attempts
Checkpoints¶
A checkpoint is a saved record of a completed durable operation: its type, name, inputs, result, and timestamp. The SDK creates checkpoints automatically as your function executes operations. Together, the checkpoints form a log that Lambda uses to resume execution after a suspension or interruption.
When your code calls a durable operation, the SDK follows this sequence:
- Check for an existing checkpoint if this operation already completed in a previous invocation, the SDK returns the stored result without re-executing
- Execute the operation if no checkpoint exists, the SDK runs the operation code
- Serialize the result the SDK serializes the result for storage
- Persist the checkpoint the SDK calls the Lambda checkpoint API to durably store the result before continuing
- Return the result execution continues to the next operation
Once the SDK persists a checkpoint, that operation's result is safe. If your function is interrupted at any point, the SDK can replay up to the last persisted checkpoint on the next invocation.
Replay¶
Lambda keeps a running log of all durable operations as your function executes. When your function needs to pause or encounters an interruption, Lambda saves this checkpoint log and stops the execution. When it's time to resume, Lambda invokes your function again from the beginning and replays the checkpoint log:
- Load checkpoint log the SDK retrieves the checkpoint log for the execution from Lambda
- Run from beginning your handler runs from the start, not from where it paused
- Skip completed operations as your code calls durable operations, the SDK checks each against the checkpoint log and returns stored results without re-executing the operation code
- Resume at interruption point when the SDK reaches an operation without a checkpoint, it executes normally and creates new checkpoints from that point forward
The SDK enforces determinism by validating that operation names and types match the checkpoint log during replay. Your orchestration code must make the same sequence of durable operation calls on every invocation.
Determinism¶
Because your code runs again on replay, it must be deterministic. Deterministic means that the code always produces the same results given the same inputs. During replay, your function runs from the beginning and must follow the same execution path as the original run. Given the same inputs and checkpoint log, your function must make the same sequence of durable operation calls. Avoid operations with side effects outside of steps, as these can produce different values during replay and cause non-deterministic behavior.
These are some examples of non-deterministic code:
- Random number generation and UUIDs
- Current time or timestamps
- External API calls and database queries
- File system operations
Wrap such non-deterministic code in steps.
Rules for deterministic durable operations¶
- All durable operations in a context must start sequentially.
- To run durable operations concurrently, wrap each set of operations in its own child context and then run the child contexts concurrently.
- Only use the child
DurableContextin the child context scope. Do not use any parent's context in a child context scope.
Replay Walkthrough¶
Let's trace through a simple workflow:
import {
DurableContext,
withDurableExecution,
} from "@aws/durable-execution-sdk-js";
export const handler = withDurableExecution(
async (event: { id: string }, context: DurableContext) => {
// Step 1: Fetch data — result is checkpointed
const data = await context.step("fetch-data", async () => {
return fetchData(event.id);
});
// Step 2: Wait 30 seconds without consuming compute resources
await context.wait({ seconds: 30 });
// Step 3: Process the data — only runs after the wait completes
const result = await context.step("process-data", async () => {
return processData(data);
});
return result;
},
);
async function fetchData(id: string): Promise<string> {
return `data-for-${id}`;
}
async function processData(data: string): Promise<string> {
return `processed-${data}`;
}
from aws_durable_execution_sdk_python import DurableContext, StepContext, durable_execution, durable_step
from aws_durable_execution_sdk_python.config import Duration
@durable_step
def fetch_data(ctx: StepContext, id: str) -> str:
return f"data-for-{id}"
@durable_step
def process_data(ctx: StepContext, data: str) -> str:
return f"processed-{data}"
@durable_execution
def handler(event: dict, context: DurableContext) -> dict:
# Step 1: Fetch data — result is checkpointed
data = context.step(fetch_data(event["id"]))
# Step 2: Wait 30 seconds without consuming compute resources
context.wait(Duration.from_seconds(30))
# Step 3: Process the data — only runs after the wait completes
result = context.step(process_data(data))
return result
import java.time.Duration;
import software.amazon.lambda.durable.DurableContext;
import software.amazon.lambda.durable.DurableHandler;
public class ExecutionModelExample extends DurableHandler<java.util.Map<String, String>, String> {
@Override
public String handleRequest(java.util.Map<String, String> event, DurableContext context) {
// Step 1: Fetch data — result is checkpointed
String data = context.step("fetch-data", String.class,
stepCtx -> fetchData(event.get("id")));
// Step 2: Wait 30 seconds without consuming compute resources
context.wait("wait-30s", Duration.ofSeconds(30));
// Step 3: Process the data — only runs after the wait completes
String result = context.step("process-data", String.class,
stepCtx -> processData(data));
return result;
}
private String fetchData(String id) {
return "data-for-" + id;
}
private String processData(String data) {
return "processed-" + data;
}
}
First invocation (t=0s):
- You start a durable execution by invoking a durable function
- The durable functions service invokes your durable function handler
- The fetch step runs and calls an external API
- The SDK checkpoints the result of the fetch step
- Execution reaches
context.wait()and the SDK checkpoints the wait operation - The SDK terminates the current Lambda invocation, but the durable execution is still active
Second invocation (t=30s):
- The durable functions service invokes your function again
- The function runs from the ginning
- The fetch step returns its checkpointed result instantly, it does not re-execute the API call
- The wait has already elapsed, so execution continues
- The process step runs for the first time
- The SDK checkpoints the result of the process step
- The function returns naturally and the invocation ends
- The durable execution ends