Tolerated failure threshold for Distributed Map state - AWS Step Functions

Tolerated failure threshold for Distributed Map state

When you orchestrate large-scale parallel workloads, you can also define a tolerated failure threshold. This value lets you specify the maximum number of, or percentage of, failed items as a failure threshold for a Map Run. Depending on which value you specify, your Map Run fails automatically if it exceeds the threshold. If you specify both values, the workflow fails when it exceeds either value.

Specifying a threshold helps you fail a specific number of items before the entire Map Run fails. Step Functions returns a States.ExceedToleratedFailureThreshold error when the Map Run fails because the specified threshold is exceeded.

Note

Step Functions may continue to run child workflows in a Map Run even after the tolerated failure threshold is exceeded, but before the Map Run fails.

To specify the threshold value in Workflow Studio, select Set a tolerated failure threshold in Additional configuration under the Runtime settings field.

Tolerated failure percentage

Defines the percentage of failed items to tolerate. Your Map Run fails if this value is exceeded. Step Functions calculates the percentage of failed items as the result of the total number of failed or timed out items divided by the total number of items. You must specify a value between zero and 100. The default percentage value is zero, which means that the workflow fails if any one of its child workflow executions fails or times out. If you specify the percentage as 100, the workflow won’t fail even if all child workflow executions fail.

Alternatively, you can specify the percentage as a reference path to an existing key-value pair in your Distributed Map state input. This path must resolve to a positive integer between 0 and 100 at runtime. You specify the reference path in the ToleratedFailurePercentagePath sub-field.

For example, given the following input:

{ "percentage": 15 }

You can specify the percentage using a reference path to that input as follows:

{ ... "Map": { "Type": "Map", ... "ToleratedFailurePercentagePath": "$.percentage" ... } }
Important

You can specify either ToleratedFailurePercentage or ToleratedFailurePercentagePath, but not both in your Distributed Map state definition.

Tolerated failure count

Defines the number of failed items to tolerate. Your Map Run fails if this value is exceeded.

Alternatively, you can specify the count as a reference path to an existing key-value pair in your Distributed Map state input. This path must resolve to a positive integer at runtime. You specify the reference path in the ToleratedFailureCountPath sub-field.

For example, given the following input:

{ "count": 10 }

You can specify the number using a reference path to that input as follows:

{ ... "Map": { "Type": "Map", ... "ToleratedFailureCountPath": "$.count" ... } }
Important

You can specify either ToleratedFailureCount or ToleratedFailureCountPath, but not both in your Distributed Map state definition.