Caching Pipeline Steps
When you use step signature caching, SageMaker Pipelines tries to find a previous run of your current pipeline step with the same values for certain attributes. If found, SageMaker Pipelines propagates the outputs from the previous run rather than recomputing the step. The attributes checked are specific to the step type, and are listed in Default cache key attributes by pipeline step type.
You must opt in to step caching — it is off by default. When you turn on step caching, you must also define a timeout. This timeout defines how old a previous run can be to remain a candidate for reuse.
Step caching only considers successful runs — it never reuses failed runs. When multiple successful runs exist within the timeout period, SageMaker Pipelines uses the result for the most recent successful run. If no successful runs match in the timeout period, SageMaker Pipelines reruns the step. If the executor finds a previous run that meets the criteria but is still in progress, both steps continue running and update the cache if they're successful.
Step caching is only scoped for individual pipelines, so you can’t reuse a step from another pipeline even if there is a step signature match.
Step caching is available for the following step types:
Topics
Turn on step caching
To turn on step caching, you must add a CacheConfig
property to the step
definition.
CacheConfig
properties use the following format in the pipeline definition
file:
{ "CacheConfig": { "Enabled": false, "ExpireAfter": "<time>" } }
The Enabled
field indicates whether caching is turned on for the particular
step. You can set the field to true
, which tells SageMaker to try to find a previous
run of the step with the same attributes. Or, you can set the field to false
,
which tells SageMaker to run the step every time the pipeline runs. ExpireAfter
is a
string in ISO 8601
durationExpireAfter
duration can be a
year, month, week, day, hour, or minute value. Each value consists of a number followed by a
letter indicating the unit of duration. For example:
-
"30d" = 30 days
-
"5y" = 5 years
-
"T16m" = 16 minutes
-
"30dT5h" = 30 days and 5 hours.
The following discussion describes the procedure to turn on caching for new or pre-existing pipelines using the Amazon SageMaker Python SDK.
Turn on caching for new pipelines
For new pipelines, initialize a CacheConfig
instance with
enable_caching=True
and provide it as an input to your pipeline step. The
following example turns on caching with a 1-hour timeout period for a training step:
from sagemaker.workflow.pipeline_context import PipelineSession from sagemaker.workflow.steps import CacheConfig cache_config = CacheConfig(enable_caching=True, expire_after="PT1H") estimator = Estimator(..., sagemaker_session=PipelineSession()) step_train = TrainingStep( name="TrainAbaloneModel", step_args=estimator.fit(inputs=inputs), cache_config=cache_config )
Turn on caching for pre-existing pipelines
To turn on caching for pre-existing, already-defined pipelines, turn on the
enable_caching
property for the step, and set expire_after
to a
timeout value. Lastly, update the pipeline with pipeline.upsert()
or
pipeline.update()
. Once you run it again, the following code example turns on
caching with a 1-hour timeout period for a training step:
from sagemaker.workflow.pipeline_context import PipelineSession from sagemaker.workflow.steps import CacheConfig from sagemaker.workflow.pipeline import Pipeline cache_config = CacheConfig(enable_caching=True, expire_after="PT1H") estimator = Estimator(..., sagemaker_session=PipelineSession()) step_train = TrainingStep( name="TrainAbaloneModel", step_args=estimator.fit(inputs=inputs), cache_config=cache_config ) # define pipeline pipeline = Pipeline( steps=[step_train] ) # additional step for existing pipelines pipeline.update() # or, call upsert() to update the pipeline # pipeline.upsert()
Alternatively, update the cache config after you have already defined the (pre-existing) pipeline, allowing one continuous code run. The following code sample demonstrates this method:
# turn on caching with timeout period of one hour pipeline.steps[0].cache_config.enable_caching = True pipeline.steps[0].cache_config.expire_after = "PT1H" # additional step for existing pipelines pipeline.update() # or, call upsert() to update the pipeline # pipeline.upsert()
For more detailed code examples and a discussion about how Python SDK parameters affect
caching, see Caching Configuration
Turn off step caching
A pipeline step does not rerun if you change any attributes that are not listed in Default cache key attributes by pipeline step type for its step type. However, you may decide that you want the pipeline step to rerun anyway. In this case, you need to turn off step caching.
To turn off step caching, set the Enabled
attribute in the step
definition’s CacheConfig
property in the step definition to false
,
as shown in the following code snippet:
{ "CacheConfig": { "Enabled": false, "ExpireAfter": "<time>" } }
Note that the ExpireAfter
attribute is ignored when Enabled
is
false
.
To turn off caching for a pipeline step using the Amazon SageMaker Python SDK, define the pipeline
of your pipeline step, turn off the enable_caching
property, and update the
pipeline.
Once you run it again, the following code example turns off caching for a training step:
from sagemaker.workflow.pipeline_context import PipelineSession from sagemaker.workflow.steps import CacheConfig from sagemaker.workflow.pipeline import Pipeline cache_config = CacheConfig(enable_caching=False, expire_after="PT1H") estimator = Estimator(..., sagemaker_session=PipelineSession()) step_train = TrainingStep( name="TrainAbaloneModel", step_args=estimator.fit(inputs=inputs), cache_config=cache_config ) # define pipeline pipeline = Pipeline( steps=[step_train] ) # update the pipeline pipeline.update() # or, call upsert() to update the pipeline # pipeline.upsert()
Alternatively, turn off the enable_caching
property after you have already
defined the pipeline, allowing one continuous code run. The following code sample
demonstrates this solution:
# turn off caching for the training step pipeline.steps[0].cache_config.enable_caching = False # update the pipeline pipeline.update() # or, call upsert() to update the pipeline # pipeline.upsert()
For more detailed code examples and a discussion about how Python SDK parameters affect
caching, see Caching Configuration
Default cache key attributes by pipeline step type
When deciding whether to reuse a previous pipeline step or rerun the step, SageMaker Pipelines checks to see if certain attributes have changed. If the set of attributes is different from all previous runs within the timeout period, the step runs again. These attributes include input artifacts, app or algorithm specification, and environment variables.
The following list shows each pipeline step type and the attributes that, if changed,
initiate a rerun of the step. For more information about which Python SDK parameters are used
to create the following attributes, see Caching Configuration
-
AppSpecification
-
Environment
-
ProcessingInputs. This attribute contains information about the preprocessing script.
-
AlgorithmSpecification
-
CheckpointConfig
-
DebugHookConfig
-
DebugRuleConfigurations
-
Environment
-
HyperParameters
-
InputDataConfig. This attribute contains information about the training script.
-
HyperParameterTuningJobConfig
-
TrainingJobDefinition. This attribute is composed of multiple child attributes, not all of which cause the step to rerun. The child attributes that could incur a rerun (if changed) are:
-
AlgorithmSpecification
-
HyperParameterRanges
-
InputDataConfig
-
StaticHyperParameters
-
TuningObjective
-
-
TrainingJobDefinitions
-
AutoMLJobConfig. This attribute is composed of multiple child attributes, not all of which cause the step to rerun. The child attributes that could incur a rerun (if changed) are:
-
CompletionCriteria
-
CandidateGenerationConfig
-
DataSplitConfig
-
Mode
-
-
AutoMLJobObjective
InputDataConfig
ProblemType
-
DataProcessing
-
Environment
-
ModelName
-
TransformInput
-
ClarifyCheckConfig
-
CheckJobConfig
-
SkipCheck
-
RegisterNewBaseline
-
ModelPackageGroupName
-
SuppliedBaselineConstraints
-
QualityCheckConfig
-
CheckJobConfig
-
SkipCheck
-
RegisterNewBaseline
-
ModelPackageGroupName
-
SuppliedBaselineConstraints
-
SuppliedBaselineStatistics
-
ClusterId
-
StepConfig
Cached data access control
When a SageMaker pipeline runs, it caches the parameters and metadata associated with the SageMaker jobs launched by the pipeline and saves them for reuse in subsequent runs. This metadata is accessible through a variety of sources in addition to cached pipeline steps, and includes the following types:
-
Describe*Job
requests -
CloudWatch Logs
-
CloudWatch Events
-
CloudWatch Metrics
-
SageMaker Search
Note that access to each data source in the list is controlled by its own set of IAM
permissions. Removing a particular role’s access to one data source does not affect the level
of access to the others. For example, an account admin might remove IAM permissions for
Describe*Job
requests from a caller’s role. While the caller can no longer make
Describe*Job
requests, they can still retrieve the metadata from a pipeline run
with cached steps as long as they have permission to run the pipeline. If an account admin
wants to remove access to the metadata from a particular SageMaker job completely, they need to
remove permissions for each of the relevant services that provide access to the data.