Step 3. Define the pipeline

In this step, the sequence and logic of actions that the pipeline will perform are defined. This includes discrete steps as well as their logical inputs and outputs. For example, what is the state of the data at the beginning of the pipeline? Does it come from multiple files that are at different levels of granularity or from a single flat file? If the data comes from multiple files, do you need a single step for all files or separate steps for each file to define the preprocessing logic? The decision depends on the complexity of the data sources and the extent to which they are preprocessed.
In our reference implementation, we use AWS Step Functions
Using the Step Functions SDK
To define the ML pipeline, we first use the high-level Python API provided by the AWS Step Functions Data
Science SDK (the Step Functions SDK) to define two key components of the pipeline: steps and
data. If you think of a pipeline as a directed acyclic graph (DAG), steps represent the nodes on
the graph, and data is shown as directed edges that connect one node (step) to the next. Typical
examples of ML steps include preprocessing, training, and evaluation. The Step Functions SDK provides a
number of built-in steps (such as the TrainingStep
ML pipelines also require configuration parameters to perform fine-grained control over the behavior of each ML step. These special data placeholders are called parameter placeholders. Many of their values are unknown when you’re defining the pipeline. Examples of parameter placeholders include infrastructure-related parameters that you define during pipeline design (for example, AWS Region or container image URL) and ML modeling-related parameters (such as hyperparameters) that you define when you run the pipeline.
Extending the Step Functions SDK
In our reference implementation, one requirement was to separate ML pipeline definitions from concrete ML pipeline creation and deployment by using specific parameter settings. However, some of the built-in steps in the Step Functions SDK didn’t allow us to pass in all these placeholder parameters. Instead, parameter values were expected to be directly obtained during pipeline design time through SageMaker AI configuration API calls. This works fine if the SageMaker AI design-time environment is identical to the SageMaker AI runtime environment, but this is rarely the case in real-world settings. Such a tight coupling between pipeline design-time and runtime, and the assumption that the ML platform infrastructure will remain constant, significantly hinders the applicability of the designed pipeline. In fact, the ML pipeline breaks immediately when the underlying deployment platform undergoes even the slightest change.
To overcome this challenge and produce a robust ML pipeline (which we wanted to design once
and run anywhere), we implemented our own custom steps by extending some of the built-in steps,
including TrainingStep, ModelStep, and TransformerStep. These extensions
are provided in the ML Max project