Step 3. Define the pipeline - AWS Prescriptive Guidance

Step 3. Define the pipeline

Defining the ML pipeline.

In this step, the sequence and logic of actions that the pipeline will perform are defined. This includes discrete steps as well as their logical inputs and outputs. For example, what is the state of the data at the beginning of the pipeline? Does it come from multiple files that are at different levels of granularity or from a single flat file? If the data comes from multiple files, do you need a single step for all files or separate steps for each file to define the preprocessing logic? The decision depends on the complexity of the data sources and the extent to which they are preprocessed.

In our reference implementation, we use AWS Step Functions, which is a serverless function orchestrator, to define the workflow steps. However, the ML Max framework also supports other pipeline or state machine systems such as Apache AirFlow (see the Different engines for pipeline orchestration section) to drive the development and deployment of ML pipelines.

Using the Step Functions SDK

To define the ML pipeline, we first use the high-level Python API provided by the AWS Step Functions Data Science SDK (the Step Functions SDK) to define two key components of the pipeline: steps and data. If you think of a pipeline as a directed acyclic graph (DAG), steps represent the nodes on the graph, and data is shown as directed edges that connect one node (step) to the next. Typical examples of ML steps include preprocessing, training, and evaluation. The Step Functions SDK provides a number of built-in steps (such as the TrainingStep) that you can use. Examples of data include input, output, and many intermediate datasets that are produced by some steps in the pipeline. When you’re designing an ML pipeline, you don’t know the concrete values of the data items. You can define data placeholders that serve as a template (similar to function parameters) and contain only the name of the data item and primitive data types. In this way, you can design a complete pipeline blueprint without knowing the concrete values of data traveling on the graph in advance. For this purpose, you can use the placeholder classes in the Step Functions SDK to explicitly model these data templates.

ML pipelines also require configuration parameters to perform fine-grained control over the behavior of each ML step. These special data placeholders are called parameter placeholders. Many of their values are unknown when you’re defining the pipeline. Examples of parameter placeholders include infrastructure-related parameters that you define during pipeline design (for example, AWS Region or container image URL) and ML modeling-related parameters (such as hyperparameters) that you define when you run the pipeline.

Extending the Step Functions SDK

In our reference implementation, one requirement was to separate ML pipeline definitions from concrete ML pipeline creation and deployment by using specific parameter settings. However, some of the built-in steps in the Step Functions SDK didn’t allow us to pass in all these placeholder parameters. Instead, parameter values were expected to be directly obtained during pipeline design time through SageMaker AI configuration API calls. This works fine if the SageMaker AI design-time environment is identical to the SageMaker AI runtime environment, but this is rarely the case in real-world settings. Such a tight coupling between pipeline design-time and runtime, and the assumption that the ML platform infrastructure will remain constant, significantly hinders the applicability of the designed pipeline. In fact, the ML pipeline breaks immediately when the underlying deployment platform undergoes even the slightest change.

To overcome this challenge and produce a robust ML pipeline (which we wanted to design once and run anywhere), we implemented our own custom steps by extending some of the built-in steps, including TrainingStep, ModelStep, and TransformerStep. These extensions are provided in the ML Max project. The interfaces of these custom steps support far more parameter placeholders that can be populated with concrete values during pipeline creation or when the pipeline is running.