Baseline calculation, drift detection and lifecycle with ClarifyCheck and QualityCheck steps in Amazon SageMaker Model Building Pipelines - Amazon SageMaker

Baseline calculation, drift detection and lifecycle with ClarifyCheck and QualityCheck steps in Amazon SageMaker Model Building Pipelines

The following topic discusses how baselines and model versions evolve in the Amazon SageMaker Model Building Pipelines when using the ClarifyCheck and QualityCheck steps.

For the ClarifyCheck step, a baseline is a single file that resides in the step properties with the suffix constraints. For the QualityCheck step, a baseline is a combination of two files that resides in the step properties: one with the suffix statistics and the other with the suffix constraints. In the following topics we discuss these properties with a prefix that describes how they are used, impacting baseline behavior and lifecycle in these two pipeline steps. For example, the ClarifyCheck step always calculates and assigns the new baselines in the CalculatedBaselineConstraints property and the QualityCheck step does the same in the CalculatedBaselineConstraints and CalculatedBaselineStatistics properties.

Baseline calculation and registration for ClarifyCheck and QualityCheck steps

Both the ClarifyCheck and QualityCheck steps always calculate new baselines based on step inputs through the underlying processing job run. These newly calculated baselines are accessed through the properties with the prefix CalculatedBaseline. You can record these properties as the ModelMetrics of your model package in the Model Step. This model package can be registered with 5 different baselines. You can register it with one for each check type: data bias, model bias, and model explainability from running the ClarifyCheck step and model quality, and data quality from running the QualityCheck step. The register_new_baseline parameter dictates the value set in the properties with the prefix BaselineUsedForDriftCheck after a step runs.

The following table of potential use cases shows different behaviors resulting from the step parameters you can set for the ClarifyCheck and QualityCheck steps:

Possible use case that you may consider for selecting this configuration skip_check / register_new_baseline Does step do a drift check? Value of step property CalculatedBaseline Value of step property BaselineUsedForDriftCheck

You are doing regular retraining with checks enabled to get a new model version, but you want to carry over the previous baselines as the DriftCheckBaselines in the model registry for your new model version.

False/ False Drift check runs against existing baselines New baselines calculated by running the step Baseline from the latest approved model in Model Registry or the baseline supplied as step parameter

You are doing regular retraining with checks enabled to get a new model version, but you want to refresh the DriftCheckBaselines in the model registry with the newly calculated baselines for your new model version.

False/ True Drift check runs against existing baselines New baselines calculated by running the step Newly calculated baseline by running the step (value of property CalculatedBaseline)

You are initiating the pipeline to retrain a new model version because there is a violation detected by Amazon SageMaker Model Monitor on an endpoint for a particular type of check, and you want to skip this type of check against the previous baseline, but carry over the previous baseline as DriftCheckBaselines in the model registry for your new model version.

True/ False No drift check New baselines calculated by running Baseline from the latest approved model in the model registry or the baseline supplied as step parameter
This happens in the following cases:
  • You are starting the initial run of the pipeline, building your first model version, and generating the initial baselines.

  • You are initiating the pipeline to retrain a new model version because there is a violation detected by Model Monitor on the endpoint for a particular type of check. If you want to skip the check against the previous baseline and refresh the DriftCheckBaselines with the newly calculated baseline in the model registry directly.

True/ True No drift check New baselines calculated by running the step Newly calculated baseline by running the step (value of property CalculatedBaseline)
Note

If you use scientific notation in your constraint, you need to convert to float. For a preprocessing script example of how to do this, see Create a Model Quality Baseline.

When you register a model with Model Step, you can register the BaselineUsedForDriftCheck property as DriftCheckBaselines. These baseline files can then be used by Model Monitor for model and data quality checks. In addition, these baselines can also be used in the ClarifyCheckStep and QualityCheck step to compare newly trained models against the existing models that are registered in the model registry for future pipeline runs.

Drift Detection against Previous Baselines in SageMaker Pipelines

In the case of the QualityCheck step, when you initiate the pipeline for regular retraining to get a new model version, you may not want to run the training step if the data quality and the data bias has Schema for Violations (constraint_violations.json file) on the baselines of your previous approved model version. You also may not want to register the newly trained model version if the model quality, model bias, or model explainability violates the registered baseline of your previous approved model version when running the ClarifyCheck step. In these cases, you can enable the checks you want by setting the skip_check property of the corresponding check step set to False, resulting in the ClarifyCheck and QualityCheck step failing if violation is detected against previous baselines. The pipeline process then does not proceed so that the model drifted from the baseline isn't registered. ClarifyCheck and QualityCheck steps are able to get DriftCheckBaselines of the latest approved model version of a given model package group against which to compare. Previous baselines can also be supplied directly through supplied_baseline_constraints (in addition to supplied_baseline_statistics if it is a QualityCheck step) and are always prioritized over any baselines pulled from the model package group.

Baseline and model version lifecycle and evolution with SageMaker Pipelines

By setting register_new_baseline of your ClarifyCheck and QualityCheck step to False, your previous baseline is accessible through the step property prefix BaselineUsedForDriftCheck. You can then register these baselines as the DriftCheckBaselines in the new model version when you register a model with Model Step. Once you approve this new model version in the model registry, the DriftCheckBaseline in this model version becomes available for the ClarifyCheck and QualityCheck steps in the next pipeline process. If you want to refresh the baseline of a certain check type for future model versions, you can set register_new_baseline to True so that the properties with prefix BaselineUsedForDriftCheck become the newly calculated baseline. In these ways, you can preserve your preferred baselines for a model trained in the future, or refresh the baselines for drift checks when needed, managing your baseline evolution and lifecycle throughout your model training iterations.

The following diagram illustrates a model-version-centric view of the baseline evolution and lifecycle.