Key Concepts¶

Durable execution¶

A durable execution is the complete lifecycle of an AWS Lambda durable function. It uses a checkpoint and replay mechanism to track progress, suspend execution, and recover from failures. When functions resume after suspension or interruptions, previously completed checkpoints replay and the function continues execution.

The execution lifecycle could include multiple invocations of the Lambda function to complete, particularly after suspensions or failure recovery. With these replays the execution can run for extended periods (up to one year) while maintaining reliable progress despite interruptions.

Timeouts¶

The execution timeout and Lambda function Timeout are different settings. The Lambda function timeout controls how long each individual invocation can run (maximum 15 minutes). The execution timeout controls the total elapsed time for the entire durable execution (maximum 1 year).

Durable functions¶

A durable function is a Lambda function configured with the DurableConfig object at creation time. Lambda will then apply the checkpoint and replay mechanism to the function's execution to make it durable at invocation time.

DurableContext¶

DurableContext is the context object your durable function receives instead of the standard Lambda Context. It exposes all durable operations and provides methods for creating checkpoints, managing execution flow, and coordinating with external systems.

Your durable function receives a DurableContext instead of the default Lambda context:

TypeScriptPythonJava

import { DurableContext, withDurableExecution } from "@aws/durable-execution-sdk-js";

export const handler = withDurableExecution(
  async (event: any, context: DurableContext) => {
    // Your function receives DurableContext instead of Lambda context
    // Use context.step(), context.wait(), etc.
    const result = await context.step("my-step", async () => {
      return "step completed";
    });
    return result;
  },
);

from aws_durable_execution_sdk_python import DurableContext, durable_execution, durable_step, StepContext


@durable_step
def my_step(ctx: StepContext, data: dict) -> str:
    # Your business logic
    return f"step completed with {data}"


@durable_execution
def handler(event: dict, context: DurableContext):
    # Your function receives DurableContext instead of Lambda context
    # Use context.step(), context.wait(), etc.
    result = context.step(my_step(event["data"]))
    return result

import software.amazon.lambda.durable.DurableContext;
import software.amazon.lambda.durable.DurableHandler;

public class DurableContextExample extends DurableHandler<Object, String> {

    @Override
    public String handleRequest(Object input, DurableContext context) {
        // Your function receives DurableContext instead of Lambda context
        // Use context.step(), context.wait(), etc.
        return context.step("my-step", String.class, ctx -> "step completed");
    }
}

Operations¶

Operations are units of work in a durable execution. Each operation type serves a specific purpose:

Steps Execute business logic with automatic checkpointing and configurable retry
Waits Suspend execution for a duration without consuming compute resources
Callbacks Suspend execution and wait for an external system to submit a result
Invoke Invoke another Lambda function and checkpoint the result
Parallel Execute multiple independent operations concurrently
Map Execute an operation on each item in an array concurrently with optional concurrency control
Child context Group operations into an isolated context for sub-workflow organization and concurrent determinism
Wait for condition Poll for a condition with automatic checkpointing between attempts

Checkpoints¶

A checkpoint is a saved record of a completed durable operation: its type, name, inputs, result, and timestamp. The SDK creates checkpoints automatically as your function executes operations. Together, the checkpoints form a log that Lambda uses to resume execution after a suspension or interruption.

When your code calls a durable operation, the SDK follows this sequence:

Check for an existing checkpoint if this operation already completed in a previous invocation, the SDK returns the stored result without re-executing
Execute the operation if no checkpoint exists, the SDK runs the operation code
Serialize the result the SDK serializes the result for storage
Persist the checkpoint the SDK calls the Lambda checkpoint API to durably store the result before continuing
Return the result execution continues to the next operation

Once the SDK persists a checkpoint, that operation's result is safe. If your function is interrupted at any point, the SDK can replay up to the last persisted checkpoint on the next invocation.

Replay¶

Lambda keeps a running log of all durable operations as your function executes. When your function needs to pause or encounters an interruption, Lambda saves this checkpoint log and stops the execution. When it's time to resume, Lambda invokes your function again from the beginning and replays the checkpoint log:

Load checkpoint log the SDK retrieves the checkpoint log for the execution from Lambda
Run from beginning your handler runs from the start, not from where it paused
Skip completed operations as your code calls durable operations, the SDK checks each against the checkpoint log and returns stored results without re-executing the operation code
Resume at interruption point when the SDK reaches an operation without a checkpoint, it executes normally and creates new checkpoints from that point forward

The SDK enforces determinism by validating that operation names and types match the checkpoint log during replay. Your orchestration code must make the same sequence of durable operation calls on every invocation.

Determinism¶

Because your code runs again on replay, it must be deterministic. Deterministic means that the code always produces the same results given the same inputs. During replay, your function runs from the beginning and must follow the same execution path as the original run. Given the same inputs and checkpoint log, your function must make the same sequence of durable operation calls. Avoid operations with side effects outside of steps, as these can produce different values during replay and cause non-deterministic behavior.

These are some examples of non-deterministic code:

Random number generation and UUIDs
Current time or timestamps
External API calls and database queries
File system operations

Wrap such non-deterministic code in steps.

Rules for deterministic durable operations¶

All durable operations in a context must start sequentially.
To run durable operations concurrently, wrap each set of operations in its own child context and then run the child contexts concurrently.
Only use the child DurableContext in the child context scope. Do not use any parent's context in a child context scope.

Replay Walkthrough¶

Let's trace through a simple workflow:

TypeScriptPythonJava

import {
  DurableContext,
  withDurableExecution,
} from "@aws/durable-execution-sdk-js";

export const handler = withDurableExecution(
  async (event: { id: string }, context: DurableContext) => {
    // Step 1: Fetch data — result is checkpointed
    const data = await context.step("fetch-data", async () => {
      return fetchData(event.id);
    });

    // Step 2: Wait 30 seconds without consuming compute resources
    await context.wait({ seconds: 30 });

    // Step 3: Process the data — only runs after the wait completes
    const result = await context.step("process-data", async () => {
      return processData(data);
    });

    return result;
  },
);

async function fetchData(id: string): Promise<string> {
  return `data-for-${id}`;
}

async function processData(data: string): Promise<string> {
  return `processed-${data}`;
}

from aws_durable_execution_sdk_python import DurableContext, StepContext, durable_execution, durable_step
from aws_durable_execution_sdk_python.config import Duration


@durable_step
def fetch_data(ctx: StepContext, id: str) -> str:
    return f"data-for-{id}"


@durable_step
def process_data(ctx: StepContext, data: str) -> str:
    return f"processed-{data}"


@durable_execution
def handler(event: dict, context: DurableContext) -> dict:
    # Step 1: Fetch data — result is checkpointed
    data = context.step(fetch_data(event["id"]))

    # Step 2: Wait 30 seconds without consuming compute resources
    context.wait(Duration.from_seconds(30))

    # Step 3: Process the data — only runs after the wait completes
    result = context.step(process_data(data))

    return result

import java.time.Duration;
import software.amazon.lambda.durable.DurableContext;
import software.amazon.lambda.durable.DurableHandler;

public class ExecutionModelExample extends DurableHandler<java.util.Map<String, String>, String> {

    @Override
    public String handleRequest(java.util.Map<String, String> event, DurableContext context) {
        // Step 1: Fetch data — result is checkpointed
        String data = context.step("fetch-data", String.class,
                stepCtx -> fetchData(event.get("id")));

        // Step 2: Wait 30 seconds without consuming compute resources
        context.wait("wait-30s", Duration.ofSeconds(30));

        // Step 3: Process the data — only runs after the wait completes
        String result = context.step("process-data", String.class,
                stepCtx -> processData(data));

        return result;
    }

    private String fetchData(String id) {
        return "data-for-" + id;
    }

    private String processData(String data) {
        return "processed-" + data;
    }
}

First invocation (t=0s):

You start a durable execution by invoking a durable function
The durable functions service invokes your durable function handler
The fetch step runs and calls an external API
The SDK checkpoints the result of the fetch step
Execution reaches context.wait() and the SDK checkpoints the wait operation
The SDK terminates the current Lambda invocation, but the durable execution is still active

Second invocation (t=30s):

The durable functions service invokes your function again
The function runs from the ginning
The fetch step returns its checkpointed result instantly, it does not re-execute the API call
The wait has already elapsed, so execution continues
The process step runs for the first time
The SDK checkpoints the result of the process step
The function returns naturally and the invocation ends
The durable execution ends