Orchestrating large-scale parallel workloads in your state machines - AWS Step Functions

Orchestrating large-scale parallel workloads in your state machines

With Step Functions, you can orchestrate large-scale parallel workloads to perform tasks, such as on-demand processing of semi-structured data. These parallel workloads let you concurrently process large-scale data sources stored in Amazon S3. For example, you might process a single JSON or CSV file that contains large amounts of data. Or you might process a large set of Amazon S3 objects.

To set up a large-scale parallel workload in your workflows, include a Map state in Distributed mode. The Map state processes items in a dataset concurrently. In Distributed mode, the Map state allows high-concurrency processing. In Distributed mode, the Map state processes the items in the dataset in iterations called child workflow executions. You can specify the number of child workflow executions that can run in parallel. If you don't specify, Step Functions runs 10,000 parallel child workflow executions in parallel. For more information about Map state and its Distributed mode, see Map state and Using Map state in Distributed mode.

When you don't specify Distributed mode, the Map state runs in the default Inline mode, which supports up to 40 concurrent iterations. For more information about the two Map state modes, see Map state processing modes.

To get started with orchestrating large-scale parallel workloads in your workflows, see the tutorial Copying large-scale CSV data using Distributed Map.

The following illustration explains how you can set up large-scale parallel workloads in your workflows.


      Diagram to illustrate the concept of orchestrating large-scale parallel workloads.

Key terms used in this topic

Distributed mode

A processing mode of the Map state. In this mode, each iteration of the Map state runs as a child workflow execution that enables high concurrency. Each child workflow execution has its own execution history, which is separate from the parent workflow's execution history. This mode supports reading input from large-scale Amazon S3 data sources.

Distributed Map state

A Map state set to Distributed processing mode.

Map workflow

A set of steps that a Map state runs.

Child workflow execution

An iteration of the Distributed Map state. A child workflow execution has its own execution history, which is separate from the parent workflow's execution history.

Map Run

When you run a Map state in Distributed mode, Step Functions creates a Map Run resource. A Map Run refers to a set of child workflow executions that a Distributed Map state starts, and the runtime settings that control these executions. Step Functions assigns an Amazon Resource Name (ARN) to your Map Run. You can examine a Map Run in the Step Functions console. You can also invoke the DescribeMapRun API action. A Map Run also emits metrics to CloudWatch.

For more information, see Examining Map Run.