Baseline calculation, drift detection and lifecycle with ClarifyCheck and QualityCheck steps in Amazon SageMaker Pipelines
The following topic discusses how baselines and model versions evolve in the Amazon SageMaker Pipelines when using the ClarifyCheck and QualityCheck steps.
For the ClarifyCheck
step, a baseline is a single file that resides in the step
properties with the suffix constraints
. For the QualityCheck
step, a
baseline is a combination of two files that resides in the step properties: one with the suffix
statistics
and the other with the suffix constraints
. In the
following topics we discuss these properties with a prefix that describes how they are used,
impacting baseline behavior and lifecycle in these two pipeline steps. For example, the
ClarifyCheck
step always calculates and assigns the new baselines in the
CalculatedBaselineConstraints
property and the QualityCheck
step
does the same in the CalculatedBaselineConstraints
and
CalculatedBaselineStatistics
properties.
Baseline calculation and registration for ClarifyCheck and QualityCheck steps
Both the ClarifyCheck
and QualityCheck
steps always calculate
new baselines based on step inputs through the underlying processing job run. These newly
calculated baselines are accessed through the properties with the prefix
CalculatedBaseline
. You can record these properties as the
ModelMetrics
of your model package in the Model step. This model package can be registered with 5 different
baselines. You can register it with one for each check type: data bias, model bias, and model
explainability from running the ClarifyCheck
step and model quality, and data
quality from running the QualityCheck
step. The
register_new_baseline
parameter dictates the value set in the properties with
the prefix BaselineUsedForDriftCheck
after a step runs.
The following table of potential use cases shows different behaviors resulting from the
step parameters you can set for the ClarifyCheck
and QualityCheck
steps:
Possible use case that you may consider for selecting this configuration | skip_check / register_new_baseline |
Does step do a drift check? | Value of step property CalculatedBaseline |
Value of step property BaselineUsedForDriftCheck |
---|---|---|---|---|
You are doing regular retraining with checks enabled to get a new model
version, but you want to carry over the previous baselines as
the |
False / False |
Drift check runs against existing baselines | New baselines calculated by running the step | Baseline from the latest approved model in Model Registry or the baseline supplied as step parameter |
You are doing regular retraining with checks enabled to get a new model
version, but you want to refresh the |
False / True |
Drift check runs against existing baselines | New baselines calculated by running the step | Newly calculated baseline by running the step (value of property
CalculatedBaseline ) |
You are initiating the pipeline to retrain a new model version because
there is a violation detected by Amazon SageMaker Model Monitor on an endpoint for a particular type of
check, and you want to skip this type of check against the previous
baseline, but carry over the previous baseline as |
True / False |
No drift check | New baselines calculated by running | Baseline from the latest approved model in the model registry or the baseline supplied as step parameter |
This happens in the following cases:
|
True / True |
No drift check | New baselines calculated by running the step | Newly calculated baseline by running the step (value of property
CalculatedBaseline ) |
Note
If you use scientific notation in your constraint, you need to convert to float. For a preprocessing script example of how to do this, see Create a Model Quality Baseline.
When you register a model with Model step, you can register the
BaselineUsedForDriftCheck
property as DriftCheckBaselines
. These
baseline files can then be used by Model Monitor for model and data quality checks. In addition, these
baselines can also be used in the ClarifyCheckStep and QualityCheck
step to
compare newly trained models against the existing models that are registered in the model
registry for future pipeline runs.
Drift Detection against Previous Baselines in Pipelines
In the case of the QualityCheck
step, when you initiate the pipeline for
regular retraining to get a new model version, you may not want to run the training step if
the data quality and the data bias has Schema for Violations
(constraint_violations.json file) on the baselines of your
previous approved model version. You also may not want to register the newly trained model
version if the model quality, model bias, or model explainability violates the registered
baseline of your previous approved model version when running the ClarifyCheck
step. In these cases, you can enable the checks you want by setting the
skip_check
property of the corresponding check step set to False
,
resulting in the ClarifyCheck
and QualityCheck
step failing if
violation is detected against previous baselines. The pipeline process then does not proceed
so that the model drifted from the baseline isn't registered. ClarifyCheck
and
QualityCheck
steps are able to get DriftCheckBaselines
of the
latest approved model version of a given model package group against which to compare.
Previous baselines can also be supplied directly through
supplied_baseline_constraints
(in addition to
supplied_baseline_statistics
if it is a QualityCheck
step) and are
always prioritized over any baselines pulled from the model package group.
Baseline and model version lifecycle and evolution with Pipelines
By setting register_new_baseline
of your ClarifyCheck
and
QualityCheck
step to False
, your previous baseline is accessible
through the step property prefix BaselineUsedForDriftCheck
. You can then register
these baselines as the DriftCheckBaselines
in the new model version when you
register a model with Model step. Once
you approve this new model version in the model registry, the DriftCheckBaseline
in this model version becomes available for the ClarifyCheck
and
QualityCheck
steps in the next pipeline process. If you want to refresh the
baseline of a certain check type for future model versions, you can set
register_new_baseline
to True
so that the properties with prefix
BaselineUsedForDriftCheck
become the newly calculated baseline. In these ways,
you can preserve your preferred baselines for a model trained in the future, or refresh the
baselines for drift checks when needed, managing your baseline evolution and lifecycle
throughout your model training iterations.
The following diagram illustrates a model-version-centric view of the baseline evolution and lifecycle.