Handling errors in Step Functions workflows - AWS Step Functions

Handling errors in Step Functions workflows

All states, except Pass and Wait states, can encounter runtime errors. Errors can happen for various reasons, such as the following:

  • State machine definition issues (for example, no matching rule in aChoice state)

  • Task failures (for example, an exception in a AWS Lambda function)

  • Transient issues (for example, network partition events)

By default, when a state reports an error, AWS Step Functions causes the execution to fail entirely.

Tip

To deploy an example of a workflow that includes error handling to your AWS account, see Error Handling in The AWS Step Functions Workshop.

Error names

Step Functions identifies errors in the Amazon States Language using case-sensitive strings, known as error names. The Amazon States Language defines a set of built-in strings that name well-known errors, all beginning with the States. prefix.

States.ALL

A wildcard that matches any known error name.

Note

This error type can't catch the States.DataLimitExceeded terminal error type and runtime error types. For more information about these error types, see States.DataLimitExceeded and States.Runtime.

States.DataLimitExceeded

Reported due to the following conditions:

  • When the output of a connector is larger than payload size quota.

  • When the output of a state is larger than payload size quota.

  • When, after Parameters processing, the input of a state is larger than the payload size quota.

For more information on quotas, see Step Functions service quotas.

Note

This is a terminal error that cannot be caught by the States.ALL error type.

States.ExceedToleratedFailureThreshold

A Map state failed because the number of failed items exceeded the threshold specified in the state machine definition. For more information, see Setting failure thresholds for Distributed Map states in Step Functions.

States.HeartbeatTimeout

A Task state failed to send a heartbeat for a period longer than the HeartbeatSeconds value.

Note

This error is only available inside the Catch and Retry fields.

States.Http.Socket

This error occurs when an HTTP task times about after 60 seconds. See Quotas related to HTTP Task.

States.IntrinsicFailure

An attempt to invoke an intrinsic function within a payload template failed.

States.ItemReaderFailed

A Map state failed because it couldn't read from the item source specified in the ItemReader field. For more information, see ItemReader (Map).

States.NoChoiceMatched

A Choice state failed to match the input with the conditions defined in the Choice Rule and a Default transition isn't specified.

States.ParameterPathFailure

An attempt to replace a field, within a state's Parameters field, whose name ends in .$ using a path fails.

States.Permissions

A Task state failed because it had insufficient privileges to run the specified code.

States.ResultPathMatchFailure

Step Functions failed to apply a state's ResultPath field to the input the state received.

States.ResultWriterFailed

A Map state failed because it couldn't write results to the destination specified in the ResultWriter field. For more information, see ResultWriter (Map).

States.Runtime

An execution failed due to some exception that it couldn't process. Often these are caused by errors at runtime, such as attempting to apply InputPath or OutputPath on a null JSON payload. A States.Runtime error isn't retriable, and will always cause the execution to fail. A retry or catch on States.ALL won't catch States.Runtime errors.

States.TaskFailed

A Task state failed during the execution. When used in a retry or catch, States.TaskFailed acts as a wildcard that matches any known error name except for States.Timeout.

States.Timeout

A Task state either ran longer than the TimeoutSeconds value, or failed to send a heartbeat for a period longer than the HeartbeatSeconds value.

Additionally, if a state machine runs longer than the specified TimeoutSeconds value, the execution fails with a States.Timeout error.

States can report errors with other names. However, these error names can't begin with the States. prefix.

As a best practice, ensure production code can handle AWS Lambda service exceptions (Lambda.ServiceException and Lambda.SdkClientException). For more information, see Handle transient Lambda service exceptions.

Note

Unhandled errors in Lambda are reported as Lambda.Unknown in the error output. These include out-of-memory errors and function timeouts. You can match on Lambda.Unknown, States.ALL, or States.TaskFailed to handle these errors. When Lambda hits the maximum number of invocations, the error is Lambda.TooManyRequestsException. For more information about Lambda Handled and Unhandled errors, see FunctionError in the AWS Lambda Developer Guide.

Retrying after an error

Task, Parallel, and Map states can have a field named Retry, whose value must be an array of objects known as retriers. An individual retrier represents a certain number of retries, usually at increasing time intervals.

When one of these states reports an error and there's a Retry field, Step Functions scans through the retriers in the order listed in the array. When the error name appears in the value of a retrier's ErrorEquals field, the state machine makes retry attempts as defined in the Retry field.

If your redriven execution reruns a Task workflow state, Parallel workflow state, or Inline Map state, for which you have defined retries, the retry attempt count for these states is reset to 0 to allow for the maximum number of attempts on redrive. For a redriven execution, you can track individual retry attempts of these states using the console. For more information, see Retry behavior of redriven executions in Restarting state machine executions with redrive in Step Functions.

A retrier contains the following fields:

Note

Retries are treated as state transitions. For information about how state transitions affect billing, see Step Functions Pricing.

ErrorEquals (Required)

A non-empty array of strings that match error names. When a state reports an error, Step Functions scans through the retriers. When the error name appears in this array, it implements the retry policy described in this retrier.

IntervalSeconds (Optional)

A positive integer that represents the number of seconds before the first retry attempt (1 by default). IntervalSeconds has a maximum value of 99999999.

MaxAttempts (Optional)

A positive integer that represents the maximum number of retry attempts (3 by default). If the error recurs more times than specified, retries cease and normal error handling resumes. A value of 0 specifies that the error is never retried. MaxAttempts has a maximum value of 99999999.

BackoffRate (Optional)

The multiplier by which the retry interval denoted by IntervalSeconds increases after each retry attempt. By default, the BackoffRate value increases by 2.0.

For example, say your IntervalSeconds is 3, MaxAttempts is 3, and BackoffRate is 2. The first retry attempt takes place three seconds after the error occurs. The second retry takes place six seconds after the first retry attempt. While the third retry takes place 12 seconds after the second retry attempt.

MaxDelaySeconds (Optional)

A positive integer that sets the maximum value, in seconds, up to which a retry interval can increase. This field is helpful to use with the BackoffRate field. The value you specify in this field limits the exponential wait times resulting from the backoff rate multiplier applied to each consecutive retry attempt. You must specify a value greater than 0 and less than 31622401 for MaxDelaySeconds.

If you don't specify this value, Step Functions doesn't limit the wait times between retry attempts.

JitterStrategy (Optional)

A string that determines whether or not to include jitter in the wait times between consecutive retry attempts. Jitter reduces simultaneous retry attempts by spreading these out over a randomized delay interval. This string accepts FULL or NONE as its values. The default value is NONE.

For example, say you have set MaxAttempts as 3, IntervalSeconds as 2, and BackoffRate as 2. The first retry attempt takes place two seconds after the error occurs. The second retry takes place four seconds after the first retry attempt and the third retry takes place eight seconds after the second retry attempt. If you set JitterStrategy as FULL, the first retry interval is randomized between 0 and 2 seconds, the second retry interval is randomized between 0 and 4 seconds, and the third retry interval is randomized between 0 and 8 seconds.

Retry field examples

This section includes the following Retry field examples.

Tip

To deploy an example of an error handling workflow to your AWS account, see Error Handling module of The AWS Step Functions Workshop.

Example 1 – Retry with BackoffRate

The following example of a Retry makes two retry attempts with the first retry taking place after waiting for three seconds. Based on the BackoffRate you specify, Step Functions increases the interval between each retry until the maximum number of retry attempts is reached. In the following example, the second retry attempt starts after waiting for three seconds after the first retry.

"Retry": [ { "ErrorEquals": [ "States.Timeout" ], "IntervalSeconds": 3, "MaxAttempts": 2, "BackoffRate": 1 } ]
Example 2 – Retry with MaxDelaySeconds

The following example makes three retry attempts and limits the wait time resulting from BackoffRate at 5 seconds. The first retry takes place after waiting for three seconds. The second and third retry attempts take place after waiting for five seconds after the preceding retry attempt because of the maximum wait time limit set by MaxDelaySeconds.

"Retry": [ { "ErrorEquals": [ "States.Timeout" ], "IntervalSeconds": 3, "MaxAttempts": 3, "BackoffRate":2, "MaxDelaySeconds": 5, "JitterStrategy": "FULL" } ]

Without MaxDelaySeconds, the second retry attempt would take place six seconds after the first retry, and the third retry attempt would take place 12 seconds after the second retry.

Example 3 – Retry all errors except States.Timeout

The reserved name States.ALL that appears in a retrier's ErrorEquals field is a wildcard that matches any error name. It must appear alone in the ErrorEquals array and must appear in the last retrier in the Retry array. The name States.TaskFailed also acts a wildcard and matches any error except for States.Timeout.

The following example of a Retry field retries any error except States.Timeout.

"Retry": [ { "ErrorEquals": [ "States.Timeout" ], "MaxAttempts": 0 }, { "ErrorEquals": [ "States.ALL" ] } ]
Example 4 – Complex retry scenario

A retrier's parameters apply across all visits to the retrier in the context of a single-state execution.

Consider the following Task state.

"X": { "Type": "Task", "Resource": "arn:aws:states:us-east-1:123456789012:task:X", "Next": "Y", "Retry": [ { "ErrorEquals": [ "ErrorA", "ErrorB" ], "IntervalSeconds": 1, "BackoffRate": 2.0, "MaxAttempts": 2 }, { "ErrorEquals": [ "ErrorC" ], "IntervalSeconds": 5 } ], "Catch": [ { "ErrorEquals": [ "States.ALL" ], "Next": "Z" } ] }

This task fails four times in succession, outputting these error names: ErrorA, ErrorB, ErrorC, and ErrorB. The following occurs as a result:

  • The first two errors match the first retrier and cause waits of one and two seconds.

  • The third error matches the second retrier and causes a wait of five seconds.

  • The fourth error also matches the first retrier. However, it already reached its maximum of two retries (MaxAttempts) for that particular error. Therefore, that retrier fails and the execution redirects the workflow to the Z state through the Catch field.

Fallback states

Task, Map and Parallel states can each have a field named Catch. This field's value must be an array of objects, known as catchers.

A catcher contains the following fields.

ErrorEquals (Required)

A non-empty array of strings that match error names, specified exactly as they are with the retrier field of the same name.

Next (Required)

A string that must exactly match one of the state machine's state names.

ResultPath (Optional)

A path that determines what input the catcher sends to the state specified in the Next field.

When a state reports an error and either there is no Retry field, or if retries fail to resolve the error, Step Functions scans through the catchers in the order listed in the array. When the error name appears in the value of a catcher's ErrorEquals field, the state machine transitions to the state named in the Next field.

The reserved name States.ALL that appears in a catcher's ErrorEquals field is a wildcard that matches any error name. It must appear alone in the ErrorEquals array and must appear in the last catcher in the Catch array. The name States.TaskFailed also acts a wildcard and matches any error except for States.Timeout.

The following example of a Catch field transitions to the state named RecoveryState when a Lambda function outputs an unhandled Java exception. Otherwise, the field transitions to the EndState state.

"Catch": [ { "ErrorEquals": [ "java.lang.Exception" ], "ResultPath": "$.error-info", "Next": "RecoveryState" }, { "ErrorEquals": [ "States.ALL" ], "Next": "EndState" } ]
Note

Each catcher can specify multiple errors to handle.

Error output

When Step Functions transitions to the state specified in a catch name, the object usually contains the field Cause. This field's value is a human-readable description of the error. This object is known as the error output.

In this example, the first catcher contains a ResultPath field. This works similarly to a ResultPath field in a state's top level, resulting in two possibilities:

  • It takes the results of that state's execution and overwrites either all of, or a portion of, the state's input.

  • It takes the results and adds them to the input. In the case of an error handled by a catcher, the result of the state's execution is the error output.

Thus, for the first catcher in the example, the catcher adds the error output to the input as a field named error-info if there isn't already a field with this name in the input. Then, the catcher sends the entire input to RecoveryState. For the second catcher, the error output overwrites the input and the catcher only sends the error output to EndState.

Note

If you don't specify the ResultPath field, it defaults to $, which selects and overwrites the entire input.

When a state has both Retry and Catch fields, Step Functions uses any appropriate retriers first. If the retry policy fails to resolve the error, Step Functions applies the matching catcher transition.

Cause payloads and service integrations

A catcher returns a string payload as an output. When working with service integrations such as Amazon Athena or AWS CodeBuild, you may want to convert the Cause string to JSON. The following example of a Pass state with intrinsic functions shows how to convert a Cause string to JSON.

"Handle escaped JSON with JSONtoString": { "Type": "Pass", "Parameters": { "Cause.$": "States.StringToJson($.Cause)" }, "Next": "Pass State with Pass Processing" },

State machine examples using Retry and using Catch

The state machines defined in the following examples assume the existence of two Lambda functions: one that always fails and one that waits long enough to allow a timeout defined in the state machine to occur.

This is a definition of a Node.js Lambda function that always fails, returning the message error. In the state machine examples that follow, this Lambda function is named FailFunction. For information about creating a Lambda function, see Step 1: Create a Lambda function section.

exports.handler = (event, context, callback) => { callback("error"); };

This is a definition of a Node.js Lambda function that sleeps for 10 seconds. In the state machine examples that follow, this Lambda function is named sleep10.

Note

When you create this Lambda function in the Lambda console, remember to change the Timeout value in the Advanced settings section from 3 seconds (default) to 11 seconds.

exports.handler = (event, context, callback) => { setTimeout(function(){ }, 11000); };

Handling a failure using Retry

This state machine uses a Retry field to retry a function that fails and outputs the error name HandledError. It retries this function twice with an exponential backoff between retries.

{ "Comment": "A Hello World example of the Amazon States Language using an AWS Lambda function", "StartAt": "HelloWorld", "States": { "HelloWorld": { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:123456789012:function:FailFunction", "Retry": [ { "ErrorEquals": ["HandledError"], "IntervalSeconds": 1, "MaxAttempts": 2, "BackoffRate": 2.0 } ], "End": true } } }

This variant uses the predefined error code States.TaskFailed, which matches any error that a Lambda function outputs.

{ "Comment": "A Hello World example of the Amazon States Language using an AWS Lambda function", "StartAt": "HelloWorld", "States": { "HelloWorld": { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:123456789012:function:FailFunction", "Retry": [ { "ErrorEquals": ["States.TaskFailed"], "IntervalSeconds": 1, "MaxAttempts": 2, "BackoffRate": 2.0 } ], "End": true } } }
Note

As a best practice, tasks that reference a Lambda function should handle Lambda service exceptions. For more information, see Handle transient Lambda service exceptions.

Handling a failure using Catch

This example uses a Catch field. When a Lambda function outputs an error, it catches the error and the state machine transitions to the fallback state.

{ "Comment": "A Hello World example of the Amazon States Language using an AWS Lambda function", "StartAt": "HelloWorld", "States": { "HelloWorld": { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:123456789012:function:FailFunction", "Catch": [ { "ErrorEquals": ["HandledError"], "Next": "fallback" } ], "End": true }, "fallback": { "Type": "Pass", "Result": "Hello, AWS Step Functions!", "End": true } } }

This variant uses the predefined error code States.TaskFailed, which matches any error that a Lambda function outputs.

{ "Comment": "A Hello World example of the Amazon States Language using an AWS Lambda function", "StartAt": "HelloWorld", "States": { "HelloWorld": { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:123456789012:function:FailFunction", "Catch": [ { "ErrorEquals": ["States.TaskFailed"], "Next": "fallback" } ], "End": true }, "fallback": { "Type": "Pass", "Result": "Hello, AWS Step Functions!", "End": true } } }

Handling a timeout using Retry

This state machine uses a Retry field to retry a Task state that times out, based on the timeout value specified in TimeoutSeconds. Step Functions retries the Lambda function invocation in this Task state twice, with an exponential backoff between retries.

{ "Comment": "A Hello World example of the Amazon States Language using an AWS Lambda function", "StartAt": "HelloWorld", "States": { "HelloWorld": { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:123456789012:function:sleep10", "TimeoutSeconds": 2, "Retry": [ { "ErrorEquals": ["States.Timeout"], "IntervalSeconds": 1, "MaxAttempts": 2, "BackoffRate": 2.0 } ], "End": true } } }

Handling a timeout using Catch

This example uses a Catch field. When a timeout occurs, the state machine transitions to the fallback state.

{ "Comment": "A Hello World example of the Amazon States Language using an AWS Lambda function", "StartAt": "HelloWorld", "States": { "HelloWorld": { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:123456789012:function:sleep10", "TimeoutSeconds": 2, "Catch": [ { "ErrorEquals": ["States.Timeout"], "Next": "fallback" } ], "End": true }, "fallback": { "Type": "Pass", "Result": "Hello, AWS Step Functions!", "End": true } } }
Note

You can preserve the state input and the error by using ResultPath. See Use ResultPath to Include Both Error and Input in a Catch.