Measure Posttraining Data and Model Bias - Amazon SageMaker

# Measure Posttraining Data and Model Bias

Amazon SageMaker Clarify provides eleven posttraining data and model bias metrics to help quantify various conceptions of fairness. These concepts cannot all be satisfied simultaneously and the selection depends on specifics of the cases involving potential bias being analyzed. Most of these metrics are a combination of the numbers taken from the binary classification confusion matrices for the different demographic groups. Because fairness and bias can be defined by a wide range of metrics, human judgment is required to understand and choose which metrics are relevant to the individual use case, and customers should consult with appropriate stakeholders to determine the appropriate measure of fairness for their application.

We use the following notation to discuss the bias metrics. The conceptual model described here is for binary classification, where events are labeled as having only two possible outcomes in their sample space, referred to as positive (with value 1) and negative (with value 0). This framework is usually extensible to multicategory classification in a straightforward way or to cases involving continuous valued outcomes when needed. In the binary classification case, positive and negative labels are assigned to outcomes recorded in a raw dataset for a favored facet a and for a disfavored facet d. These labels y are referred to as observed labels to distinguish them from the predicted labels y' that are assigned by a machine learning model during the training or inferences stages of the ML lifecycle. These labels are used to define probability distributions Pa(y) and Pd(y) for their respective facet outcomes.

• labels:

• y represents the n observed labels for event outcomes in a training dataset.

• y' represents the predicted labels for the n observed labels in the dataset by a trained model.

• outcomes:

• A positive outcome (with value 1) for a sample, such as an application acceptance.

• n(1) is the number of observed labels for positive outcomes (acceptances).

• n'(1) is the number of predicted labels for positive outcomes (acceptances).

• A negative outcome (with value 0) for a sample, such as an application rejection.

• n(0) is the number of observed labels for negative outcomes (rejections).

• n'(0) is the number of predicted labels for negative outcomes (rejections).

• facet values:

• facet a – The feature value that defines a demographic that bias favors.

• na is the number of observed labels for the favored facet value: na = na(1) + na(0) the sum of the positive and negative observed labels for the value facet a.

• n'a is the number of predicted labels for the favored facet value: n'a = n'a(1) + n'a(0) the sum of the positive and negative predicted outcome labels for the facet value a. Note that n'a = na.

• facet d – The feature value that defines a demographic that bias disfavors.

• nd is the number of observed labels for the disfavored facet value: nd = nd(1) + nd(0) the sum of the positive and negative observed labels for the facet value d.

• n'd is the number of predicted labels for the disfavored facet value: n'd = n'd(1) + n'd(0) the sum of the positive and negative predicted labels for the facet value d. Note that n'd = nd.

• probability distributions for outcomes of the labeled facet data outcomes:

• Pa(y) is the probability distribution of the observed labels for facet a. For binary labeled data, this distribution is given by the ratio of the number of samples in facet a labeled with positive outcomes to the total number, Pa(y1) = na(1)/ na, and the ratio of the number of samples with negative outcomes to the total number, Pa(y0) = na(0)/ na.

• Pd(y) is the probability distribution of the observed labels for facet d. For binary labeled data, this distribution is given by the number of samples in facet d labeled with positive outcomes to the total number, Pd(y1) = nd(1)/ nd, and the ratio of the number of samples with negative outcomes to the total number, Pd(y0) = nd(0)/ nd.

The following table contains a cheat sheet for quick guidance and links to the posttraining bias metrics.

Posttraining Bias Metrics
posttraining bias metric Description Example question Interpreting metric values
Difference in Positive Proportions in Predicted Labels (DPPL) Measures the difference in the proportion of positive predictions between the favored facet a and the disfavored facet d.

Has there been an imbalance across demographic groups in the predicted positive outcomes that might indicate bias?

Range for normalized binary & multicategory facet labels: [-1,+1]

Range for continuous labels: (-∞, +∞)

Interpretation:

• Positive values indicate that the favored facet a has a higher proportion of predicted positive outcomes.

• Values near zero indicate a more equal proportion of predicted positive outcomes between facets.

• Negative values indicate the disfavored facet d has a higher proportion of predicted positive outcomes.

Disparate Impact (DI) Measures the ratio of proportions of the predicted labels for the favored facet a and the disfavored facet d. Has there been an imbalance across demographic groups in the predicted positive outcomes that might indicate bias?

Range for normalized binary, multicategory facet, and continuous labels: [0,∞)

Interpretation:

• Values greater than 1 indicate the favored facet a has a higher proportion of predicted positive outcomes.

• A value of 1 indicates that we have demographic parity.

• Values less than 1 indicate the disfavored facet d has a higher proportion of predicted positive outcomes.

Difference in Conditional Acceptance (DCAcc) Compares the observed labels to the labels predicted by a model and assesses whether this is the same across facets for predicted positive outcomes (acceptances). Are there more or less acceptances for loan applications than predicted for one age group as compared to another based on qualifications?

The range for binary, multicategory facet, and continuous labels: (-∞, +∞).

• Positive values indicate a possible bias against the qualified applicants from the disfavored facet d.

• Values near zero indicate that qualified applicants from both facets are being accepted in a similar way.

• Negative values indicate a possible bias against the qualified applicants from the favored facet a.

Difference in Conditional Rejection (DCR) Compares the observed labels to the labels predicted by a model and assesses whether this is the same across facets for negative outcomes (rejections). Are there more or less rejections for loan applications than predicted for one age group as compared to another based on qualifications? The range for binary, multicategory facet, and continuous labels: (-∞, +∞).
• Positive values indicate a possible bias against the qualified applicants from the disfavored facet d.

• Values near zero indicate that qualified applicants from both facets are being rejected in a similar way.

• Negative values indicate a possible bias against the qualified applicants from the favored facet a.

Recall Difference (RD) Compares the recall of the model for the favored and disfavored facets. Is there an age-based bias in lending due to a model having higher recall for one age group as compared to another?

Range for binary and multicategory classification: [-1, +1].

• Positive values suggest that the model finds more of the true positives for facet a and is biased against the disfavored facet d.

• Values near zero suggest that the model finds about the same number of true positives in both facets and is not biased.

• Negative values suggest that the model finds more of the true positives for facet d and is biased against the favored facet a.

Difference in Acceptance Rates (DAR) Measures the difference in the ratios of the observed positive outcomes (TP) to the predicted positives (TP + FP) between the favored and disfavored facets. Does the model have equal precision when predicting loan acceptances for qualified applicants across all age groups? The range for binary, multicategory facet, and continuous labels is [-1, +1].
• Positive values indicate a possible bias against facet d caused by the occurrence of relatively more false positives in the disfavored facet d.

• Values near zero indicate the observed labels for positive outcomes (acceptances) are being predicted with equal precision for both facets by the model.

• Negative values indicate a possible bias against facet a caused by the occurrence of relatively more false positives in the favored facet a.

Difference in Rejection Rates (DRR) Measures the difference in the ratios of the observed negative outcomes (TN) to the predicted negatives (TN + FN) between the disfavored and favored facets. Does the model have equal precision when predicting loan rejections for unqualified applicants across all age groups? The range for binary, multicategory facet, and continuous labels is [-1, +1].
• Positive values indicate a possible bias caused by the occurrence of relatively more false negatives in the favored facet a.

• Values near zero indicate the observed labels for negative outcomes (rejections)are being predicted with equal precision for both facets by the model.

• Negative values indicate a possible bias caused by the occurrence of relatively more false negatives in the disfavored facet d.

Accuracy Difference (AD) Measures the difference between the prediction accuracy for the favored and disfavored facets. Does the model predict labels as accurately for applications across all demographic groups? The range for binary and multicategory facet labels is [-1, +1].
• Positive values indicate that facet d suffers more from some combination of false positives (Type I errors) or false negatives (Type II errors). This means there is a potential bias against the disfavored facet d.

• Values near zero occur when the prediction accuracy for facet a is similar to that for facet d.

• Negative values indicate that facet a suffers more from some combination of false positives (Type I errors) or false negatives (Type II errors). This means the is a bias against the favored facet a.

Treatment Equality (TE) Measures the difference in the ratio of false positives to false negatives between the favored and disfavored facets. In loan applications, is the relative ratio of false positives to false negatives the same across all age demographics? The range for binary and multicategory facet labels: (-∞, +∞).
• Positive values occur when the ratio of false positives to false negatives for facet a is greater than that for facet d.

• Values near zero occur when the ratio of false positives to false negatives for facet a is similar to that for facet d.

• Negative values occur when the ratio of false positives to false negatives for facet a is less than that for facet d.

Conditional Demographic Disparity in Predicted Labels (CDDPL) Measures the disparity of predicted labels between the facets as a whole, but also by subgroups. Do some demographic groups have a larger proportion of rejections for loan application outcomes than their proportion of acceptances?

The range of CDDPL values for binary, multicategory, and continuous outcomes: [-1, +1]

• Positive values indicate an outcomes where facet d is rejected more than accepted.

• Near zero indicates no demographic disparity on average.

• Negative values indicate an outcomes where facet a is rejected more than accepted.

Counterfactual Fliptest (FT) Examines each member of facet d and assesses whether similar members of facet a have different model predictions. Are a group of a specific age demographic, matched closely on all features with a another age group, paid on average more than that other age group?" The range for binary and multicategory facet labels is [-1, +1].
• Positive values occur when the number of unfavorable counterfactual fliptest decisions for the disfavored facet d exceeds the favorable ones.

• Values near zero occur when the number of unfavorable and favorable counterfactual fliptest decisions balance out.

• Negative values occur when the number of unfavorable counterfactual fliptest decisions for the disfavored facet d is less than the favorable ones.

For additional information about posttraining bias metrics, see A Family of Fairness Measures for Machine Learning in Finance.

more false negatives in the disfavored facet d.