Run SageMaker Clarify Processing Jobs for Bias Analysis and Explainability
To analyze your data and models for bias and explainability using SageMaker Clarify, you must
configure a SageMaker Clarify processing job. This guide shows how to configure the job inputs,
outputs, resources, and analysis configuration using the SageMaker Python SDK API
SageMakerClarifyProcessor
.
The API acts as a high-level wrapper of the SageMaker CreateProcessingJob
API.
It hides many of the details that are involved in setting up a SageMaker Clarify processing job. The
details to set up a job include retrieving the SageMaker Clarify container image URI and generating
the analysis configuration file. The following steps show you how to configure,
initialize and launch a SageMaker Clarify processing job.
-
To configure a SageMaker Clarify processing job using the API, first define the configuration objects for each portion of the job configuration. These portions can include the following:
-
The input dataset and output location: DataConfig
. -
The model or endpoint to be analyzed: ModelConfig
. -
Bias analysis parameters: BiasConfig
. -
SHAP analysis parameters: SHAPConfig
.
The configuration objects for a SageMaker Clarify processing job vary for different types of data formats and use cases. Configuration examples for tabular data in CSV and JSON Lines format, natural language processing (NLP), and computer vision problems are provided in the following sections.
-
-
After you define the configuration objects, your next step is to create a
SageMakerClarifyProcessor
object and initialize it with parameters that specify the job resources. These resources include parameters such as the number of compute instances to use.The following code example shows how to create a
SageMakerClarifyProcessor
object and instruct it to use oneml.c4.xlarge
compute instance to do the analysis.from sagemaker import clarify clarify_processor = clarify.SageMakerClarifyProcessor( role=role, instance_count=1, instance_type='ml.c4.xlarge', sagemaker_session=session, )
-
Call the specific run method of the SageMakerClarifyProcessor
object with the configuration objects for your use case to launch the job. These run methods include the following: -
run_pre_training_bias
-
run_post_training_bias
-
run_bias
-
run_explainability
-
run_bias_and_explainability
This
SageMakerClarifyProcessor
handles several tasks behind the scenes. These tasks include retrieving the SageMaker Clarify container image universal resource identifier (URI), composing an analysis configuration file based on the provided configuration objects, uploading the file to an Amazon S3 bucket, and configuring the SageMaker Clarify processing job.The following expandable sections show how to compute pre-training and post-training bias metrics, SHAP values, and partial dependence plots (PDPs). The sections show feature importance for these data types:
-
Tabular datasets in CSV format or JSON Lines format
-
Natural language processing (NLP) datasets
-
Computer vision datasets
-
After the expandable sections, you'll learn how to run a SageMaker Clarify processing job using Spark.
The following examples show how to configure bias analysis and explainability
analysis for a tabular dataset in CSV format. In these examples, the incoming
dataset has four feature columns and one binary label column,
Target
. The contents of the dataset are as follows. A label
value of 1
indicates a positive outcome.
Target,Age,Gender,Income,Occupation 0,25,0,2850,2 1,36,0,6585,0 1,22,1,1759,1 0,48,0,3446,1 ...
This DataConfig
object specifies the input dataset and where to
store the output. The s3_data_input_path
parameter can either be a
URI of a dataset file or an Amazon S3 URI prefix. If you provide a S3 URI prefix, the
SageMaker Clarify processing job recursively collects all Amazon S3 files located under the
prefix. The value for s3_output_path
should be an S3 URI prefix to
hold the analysis results. The following code example shows how to specify a
data configuration for the previous sample input dataset.
data_config = clarify.DataConfig( s3_data_input_path=dataset_s3_uri, dataset_type='text/csv', headers=[
'Target', 'Age', 'Gender', 'Income', 'Occupation'
], label='Target', s3_output_path=clarify_job_output_s3_uri, )
How to compute all pre-training bias metrics for a CSV dataset
The following code sample shows how to configure a BiasConfig
object to measure bias of the previous sample input towards samples with a
Gender
value of 0
.
bias_config = clarify.BiasConfig( label_values_or_threshold=[1], facet_name='
Gender
', facet_values_or_threshold=[0], )
The following code example shows how to use a run statement to launch a SageMaker Clarify processing job that computes all pre-training bias metrics for an input dataset.
clarify_processor.run_pre_training_bias( data_config=data_config, data_bias_config=bias_config, methods="all", )
Alternatively, you can choose which metrics to compute by assigning a list
of pre-training bias metrics to the methods parameter. For example,
replacing methods="all"
with methods=["CI", "DPL"]
instructs the SageMaker Clarify Processor to compute only Class
Imbalance and Difference in Proportions of Labels.
How to compute all post-training bias metrics for a CSV dataset
You can compute pre-training bias metrics prior to training. However, to compute post-training bias metrics, you must have a trained model. The following example output is from a binary classification model that outputs data in CSV format. In this example output, each row contains two columns. The first column contains the predicted label, and the second column contains the probability value for that label.
0,0.028986845165491 1,0.825382471084594 ...
In the following example configuration, the ModelConfig
object instructs the job to deploy the SageMaker model to an ephemeral endpoint.
The endpoint uses one ml.m4.xlarge
inference instance. Because
the parameter content_type
and accept_type
are not
set, they automatically use the value of the parameter
dataset_type
, which is text/csv
.
model_config = clarify.ModelConfig( model_name=your_model, instance_type='ml.m4.xlarge', instance_count=1, )
The following configuration example uses a
ModelPredictedLabelConfig
object with a label index of
0
. This instructs the SageMaker Clarify processing job to locate the
predicted label in the first column of the model output. The Processing job
uses zero-based indexing in this example.
predicted_label_config = clarify.ModelPredictedLabelConfig( label=0, )
Combined with the previous configuration example, the following code example launches a SageMaker Clarify processing job to compute all the post-training bias metrics.
clarify_processor.run_post_training_bias( data_config=data_config, data_bias_config=bias_config, model_config=model_config, model_predicted_label_config=predicted_label_config, methods="all", )
Similarly, you can choose which metrics to compute by assigning a list of
post-training bias metrics to the methods
parameter. For
example, replace methods=“all”
with methods=["DPPL",
"DI"]
to compute only Difference in Positive Proportions in Predicted Labels and
Disparate Impact.
How to compute all bias metrics for a CSV dataset
The following configuration example shows how to run all pre-training and post-training bias metrics in one SageMaker Clarify processing job.
clarify_processor.run_bias( data_config=data_config, bias_config=bias_config, model_config=model_config, model_predicted_label_config=predicted_label_config, pre_training_methods="all", post_training_methods="all", )
For an example notebook with instructions on how to run a SageMaker Clarify processing job
in SageMaker Studio to detect bias, see Fairness and Explainability with SageMaker Clarify
How to compute SHAP values for a CSV dataset
SageMaker Clarify provides feature attributions using the KernelSHAP algorithmModelPredictedLabelConfig
object has probability
index 1
. This instructs the SageMaker Clarify processing job to extract the
probability score from the second column of the model output (using
zero-based indexing).
probability_config = clarify.ModelPredictedLabelConfig( probability=1, )
The SHAPConfig
object provides SHAP analysis parameters. In
this example, the SHAP baseline
parameter is omitted and the
value of the num_clusters
parameter is 1
. This
instructs the SageMaker Clarify Processor to compute one SHAP baseline sample based on
clustering the input dataset. If you want to choose the baseline dataset,
see SHAP Baselines for Explainability.
shap_config = clarify.SHAPConfig( num_clusters=1, )
The following code example launches a SageMaker Clarify processing job to compute SHAP values.
clarify_processor.run_explainability( data_config=data_config, model_config=model_config, model_scores=probability_config, explainability_config=shap_config, )
For an example notebook with instructions on how to run a SageMaker Clarify processing
job in SageMaker Studio to compute SHAP values, see Fairness and Explainability with SageMaker Clarify
How to compute partial dependence plots (PDPs) for a CSV dataset
PDPs show the dependence of the predicted target response on one or more input features of interest while holding all other features constant. An upward sloping line, or curve in the PDP, indicates that the relationship between the target and input feature(s) is positive, and the steepness indicates the strength of the relationship. A downward sloping line or curve indicates that if an input feature decreases, the target variable increases. Intuitively, you can interpret the partial dependence as the response of the target variable to each input feature of interest.
The following configuration example is for using a PDPConfig
object to instruct the SageMaker Clarify processing job to compute the importance of the
Income
feature.
pdp_config = clarify.PDPConfig( features=["Income"], grid_resolution=10, )
In the previous example, the grid_resolution
parameter
divides the range of the Income
feature values into
10
buckets. The SageMaker Clarify processing job will generate PDPs for
Income
split into 10
segments on the x-axis.
The y-axis will show the marginal impact of Income
on the
target variable.
The following code example launches a SageMaker Clarify processing job to compute PDPs.
clarify_processor.run_explainability( data_config=data_config, model_config=model_config, model_scores=probability_config, explainability_config=pdp_config, )
For an example notebook with instructions on how to run a SageMaker Clarify processing
job in SageMaker Studio to compute PDPs, see Explainability with SageMaker Clarify - Partial Dependence Plots
(PDP)
How to compute both SHAP values and PDPs for a CSV dataset
You can compute both SHAP values and PDPs in a single SageMaker Clarify processing
job. In the following configuration example, the top_k_features
parameter of a new PDPConfig
object is set to 2
.
This instructs the SageMaker Clarify processing job to compute PDPs for the
2
features that have the largest global SHAP values.
shap_pdp_config = clarify.PDPConfig( top_k_features=2, grid_resolution=10, )
The following code example launches a SageMaker Clarify processing job to compute both SHAP values and PDPs.
clarify_processor.run_explainability( data_config=data_config, model_config=model_config, model_scores=probability_config, explainability_config=[shap_config, shap_pdp_config], )
The following examples show how to configure bias analysis and explainability
analysis for a tabular dataset in >SageMaker JSON Lines dense format. See JSONLINES Request Format for more information.
In these examples, the incoming dataset has the same data as the previous
section, but they're in the JSON Lines format. Each line is a valid JSON object.
The key Features
points to an array of feature values, and the key
Label
points to the ground truth label.
{"Features":[25,0,2850,2],"Label":0} {"Features":[36,0,6585,0],"Label":1} {"Features":[22,1,1759,1],"Label":1} {"Features":[48,0,3446,1],"Label":0} ...
In the following configuration example, the DataConfig
object
specifies the input dataset and where to store the output. The features
parameter is set to the JMESPathFeatures
so that the SageMaker Clarify processing job can extract
the array of features from each record. The label
parameter is set
to JMESPath expression Label
so that the SageMaker Clarify processing job can
extract the ground truth label from each record.
data_config = clarify.DataConfig( s3_data_input_path=jsonl_dataset_s3_uri, dataset_type='application/jsonlines', headers=['Age', 'Gender', 'Income', 'Occupation', 'Target'], label='Label', features='Features', s3_output_path=clarify_job_output_s3_uri, )
You must have a trained model to compute post-training bias metrics or feature
importance. The following example is from a binary classification model that
outputs JSON Lines data in the example's format. Each row of the model output is
a valid JSON object. The key predicted_label
points to the
predicted label, and the key probability
points to the probability
value.
{"predicted_label":0,"probability":0.028986845165491} {"predicted_label":1,"probability":0.825382471084594} ...
In the following configuration example, a ModelConfig
object
instructs the SageMaker Clarify processing job to deploy the SageMaker model to an ephemeral
endpoint. The endpoint uses one ml.m4.xlarge
inference
instance.
model_config = clarify.ModelConfig( model_name=your_model, instance_type='ml.m4.xlarge', instance_count=1, content_template='{"Features":$features}', )
In previous configuration example, the parameter content_type
and
accept_type
are not set. Therefore, they automatically use the
value of the dataset_type
parameter of the DataConfig
object, which is application/jsonlines
. The SageMaker Clarify processing job
uses the content_template
parameter to compose the model input by
replacing the $features
placeholder by an array of features.
The following example configuration shows how to set the label parameter of
the ModelPredictedLabelConfig
object to the JMESPath expression
predicted_label
. This will extract the predicted label from the
model output.
predicted_label_config = clarify.ModelPredictedLabelConfig( label='predicted_label', )
The following example configuration shows how to set the
probability
parameter of the
ModelPredictedLabelConfig
object to the JMESPath expression
probability
. This will extract the score from the model
output.
probability_config = clarify.ModelPredictedLabelConfig( probability='probability', )
To compute bias metrics and feature importance for datasets in JSON Lines
format, use the same run statements and configuration objects as the previous
section for CSV datasets. You can run a SageMaker Clarify processing job in SageMaker Studio
to detect bias and compute feature importance. For instructions and an example
notebook, see Fairness and Explainability with SageMaker Clarify (JSON Lines Format)
SageMaker Clarify supports explanations for natural language processing (NLP) models. These explanations help you understand which sections of text are the most important for your model predictions. You can explain either the model prediction for a single instance of the input dataset, or model predictions from the baseline dataset.To understand and visualize a model’s behavior, you can specify multiple levels of granularity. To do this, define the length of the text segment, such as its tokens, sentences, paragraphs.
SageMaker Clarify NLP explainability is compatible with both classification and regression models. You can also use SageMaker Clarify to explain your model's behavior on multi-modal datasets that contain text, categorical, or numerical features. NLP explainability for multi-modal datasets can help you understand how important each feature is to the model's output. SageMaker Clarify supports 62 languages and can handle text which includes multiple languages.
The following example shows an analysis configuration file for computing feature importance for NLP. In this example, the incoming dataset is a tabular dataset in CSV format, with one binary label column and two feature columns.
0,2,"Flavor needs work" 1,3,"They taste good" 1,5,"The best" 0,1,"Taste is awful" ...
The following configuration example shows how to specify an input dataset in
CSV format and output data path using the DataConfig
object.
nlp_data_config = clarify.DataConfig( s3_data_input_path=nlp_dataset_s3_uri, dataset_type='text/csv', headers=['Target', 'Rating', 'Comments'], label='Target', s3_output_path=clarify_job_output_s3_uri, )
The following example output was created from a binary classification model
trained on the previous input dataset. The classification model accepts CSV
data, and it outputs a single score in between 0
and
1
.
0.491656005382537 0.569582343101501 ...
The following example shows how to configure the ModelConfig
object to deploy a SageMaker model. In this example, an ephemeral endpoint deploys
the model. This endpoint uses one ml.g4dn.xlarge
inference instance
equipped with a GPU, for accelerated inferencing.
nlp_model_config = clarify.ModelConfig( model_name=your_nlp_model_name, instance_type='ml.g4dn.xlarge', instance_count=1, )
The following example shows how to configure the
ModelPredictedLabelConfig
object to locate the probability
(score) in the first column with an index of 0
.
probability_config = clarify.ModelPredictedLabelConfig( probability=0, )
The following example SHAP configuration shows how to run a token-wise explainability analysis using a model and an input dataset in the English language.
text_config = clarify.TextConfig( language='english', granularity='token', ) nlp_shap_config = clarify.SHAPConfig( baseline=[[4, '[MASK]']], num_samples=100, text_config=text_config, )
In the previous example, the TextConfig
object activates the NLP
explainability analysis. The granularity
parameter indicates that
the analysis should parse tokens. In English, each token is a word. For other
languages, see the spaCy
documentation for tokenizationRating
of
4
to set an in-place SHAP baseline instance. A special mask
token [MASK]
is used to replace a token (word) in
Comments
.
In the previous example, if the instance is 2,"Flavor needs
work"
, set the baseline to an average Rating
of
4
with the following baseline.
4, '[MASK]'
In the previous example, the SageMaker Clarify explainer iterates through each token and replaces it with the mask, as follows.
2,"[MASK] needs work" 4,"Flavor [MASK] work" 4,"Flavor needs [MASK]"
Then, the SageMaker Clarify explainer will send each line to your model for predictions. This is so that the explainer learns the predictions with and without the masked words. The SageMaker Clarify explainer then uses this information to compute the contribution of each token.
The following code example launches a SageMaker Clarify processing job to compute SHAP values.
clarify_processor.run_explainability( data_config=nlp_data_config, model_config=nlp_model_config, model_scores=probability_config, explainability_config=nlp_shap_config, )
For an example notebook with instructions on how to run a SageMaker Clarify processing job
in SageMaker Studio for NLP explainability analysis, see Explaining Text Sentiment Analysis Using SageMaker Clarify
SageMaker Clarify generates heat maps that provide insights into how your computer vision models classify and detect objects in your images.
In the following configuration example, the input dataset consists of JPEG images.
cv_data_config = clarify.DataConfig( s3_data_input_path=cv_dataset_s3_uri, dataset_type="application/x-image", s3_output_path=clarify_job_output_s3_uri, )
In the previous configuration example, the DataConfig
object
contains an s3_data_input_path
set to an S3 URI prefix. The SageMaker Clarify processing job recursively collects all image files located under the
prefix.
How to explain an image classification model
The SageMaker Clarify processing job explains images using the KernelSHAP algorithm, which treats the image as a collection of super pixels. Given a dataset consisting of images, the processing job outputs a dataset of images where each image shows the heat map of the relevant super pixels.
The following configuration example shows how to configure an explainability analysis using a SageMaker image classification model. See Image Classification - MXNet for more information.
ic_model_config = clarify.ModelConfig( model_name=your_cv_ic_model, instance_type="ml.p2.xlarge", instance_count=1, content_type="image/jpeg", accept_type="application/json", )
In the previous configuration example, a model named
your_cv_ic_model
, has been trained to classify the animals
on input JPEG images. The ModelConfig
object in the previous
example instructs the SageMaker Clarify processing job to deploy the SageMaker model to an
ephemeral endpoint. For accelerated inferencing, the endpoint uses one
ml.p2.xlarge
inference instance equipped with a GPU.
After a JPEG image is sent to an endpoint, the endpoint classifies it and
returns a list of scores. Each score is for a category. The
ModelPredictedLabelConfig
object provides the name of each
category, as follows.
ic_prediction_config = clarify.ModelPredictedLabelConfig( label_headers=['bird', 'cat', 'dog'], )
An example output for the previous input of ['bird','cat','dog'] could be 0.3,0.6,0.1, where 0.3 represents the confidence score for classifying an image as a bird.
The following example SHAP configuration shows how to generate
explanations for an image classification problem. It uses an
ImageConfig
object to activate the analysis.
ic_image_config = clarify.ImageConfig( model_type="IMAGE_CLASSIFICATION", num_segments=20, segment_compactness=5, ) ic_shap_config = clarify.SHAPConfig( num_samples=100, image_config=ic_image_config, )
SageMaker Clarify extracts features using the Simple Linear Iterative Clustering (SLIC)model_type
parameter, indicates the type of image
classification problem. The parameter num_segments
estimates
how many approximate number of segments will be labeled in the input image.
The number of segments is then passed to the slic n_segments
parameter.
Each segment of the image is considered a super-pixel, and local SHAP
values are computed for each segment. The parameter
segment_compactness
determines the shape and size of the
image segments that are generated by the scikit-image slic method. The sizes
and shapes of the image segments are then passed to the slic
compactness
parameter.
The following code example launches a SageMaker Clarify processing job to generate heat maps for your images. Positive heat map values show that the feature increased the confidence score of detecting the object. Negative values indicate that the feature decreased the confidence score.
clarify_processor.run_explainability( data_config=cv_data_config, model_config=ic_model_config, model_scores=ic_prediction_config, explainability_config=ic_shap_config, )
For a sample notebook that uses SageMaker Clarify to classify images and explain its
classification, see Explaining Image Classification with SageMaker Clarify
How to explain an object detection model
A SageMaker Clarify processing job can detect and classify objects in an image and then provide an explanation for the detected object. The process for explanation is as follows.
-
Image objects are first categorized into one of the classes in a specified collection. For example, if an object detection model can recognize cat, dog and fish, then these three classes are in a collection. This collection is specified by the
label_headers
parameter as follows.clarify.ModelPredictedLabelConfig( label_headers=object_categories, )
-
The SageMaker Clarify processing job produces a confidence score for each object. A high confidence score indicates that it belongs to one of the classes in a specified collection. The SageMaker Clarify processing job also produces the coordinates of a bounding box that delimits the object. For more information about confidence scores and bounding boxes, see Response Formats.
-
SageMaker Clarify then provides an explanation for the detection of an object in the image scene. It uses the methods described in the How to explain an image classification model section.
In the following configuration example, a SageMaker object detection model
your_cv_od_model
is trained on JPEG images to identify the
animals on them.
od_model_config = clarify.ModelConfig( model_name=your_cv_ic_model, instance_type="ml.p2.xlarge", instance_count=1, content_type="image/jpeg", accept_type="application/json", )
The ModelConfig
object in the previous configuration example
instructs the SageMaker Clarify processing job to deploy the SageMaker model to an ephemeral
endpoint. For accelerated imaging, this endpoint uses one
ml.p2.xlarge
inference instance equipped with a GPU.
In the following example configuration, the
ModelPredictedLabelConfig
object provides the name of each
category for classification.
ic_prediction_config = clarify.ModelPredictedLabelConfig( label_headers=['bird', 'cat', 'dog'], )
The following example SHAP configuration shows how to generate explanations for an object detection.
od_image_config = clarify.ImageConfig( model_type="OBJECT_DETECTION", num_segments=20, segment_compactness=5, max_objects=5, iou_threshold=0.5, context=1.0, ) od_shap_config = clarify.SHAPConfig( num_samples=100, image_config=image_config, )
In the previous example configuration, the ImageConfig
object
activates the analysis. The model_type
parameter indicates that
the type of problem is object detection. For a detailed description of the
other parameters, see Configure the
Analysis.
The following code example launches a SageMaker Clarify processing job to generate heat maps for your images. Positive heat map values show that the feature increased the confidence score of detecting the object. Negative values indicate that the feature decreased the confidence score.
clarify_processor.run_explainability( data_config=cv_data_config, model_config=od_model_config, model_scores=od_prediction_config, explainability_config=od_shap_config, )
For a sample notebook that uses SageMaker Clarify to detect objects in an image and
explain its predictions, see Explaining object detection models with Amazon SageMaker
Clarify
How to Run a SageMaker Clarify Processing Job with Spark
When working with large datasets, you can use Apache Spark
The following configuration example shows how to use
SageMakerClarifyProcessor
to create a SageMaker Clarify processor with
5
compute instances. To run any jobs associated with the
SageMakerClarifyProcessor
, SageMaker Clarify using Spark distributed
processing.
from sagemaker import clarify spark_clarify_processor = clarify.SageMakerClarifyProcessor( role=role, instance_count=5, instance_type='ml.c5.xlarge', )
If you set the save_local_shap_values
parameter of SHAPConfigTrue
, the SageMaker Clarify processing job saves the
local SHAP value as multiple part files in the job output location.
To associate the local SHAP values to the input dataset instances, use the
joinsource
parameter of DataConfig
. If you add more
compute instances, we recommend that you also increase the
instance_count
of ModelConfig