Total Variation Distance (TVD)

The total variation distance data bias metric (TVD) is half the L₁-norm. The TVD is the largest possible difference between the probability distributions for label outcomes of facets a and d. The L₁-norm is the Hamming distance, a metric used compare two binary data strings by determining the minimum number of substitutions required to change one string into another. If the strings were to be copies of each other, it determines the number of errors that occurred when copying. In the bias detection context, TVD quantifies how many outcomes in facet a would have to be changed to match the outcomes in facet d.

The formula for the Total variation distance is as follows:

TVD = ½_*L₁(P_a, P_d)

For example, assume you have an outcome distribution with three categories, y_i = {y₀, y₁, y₂} = {accepted, waitlisted, rejected}, in a college admissions multicategory scenario. You take the differences between the counts of facets a and d for each outcome to calculate TVD. The result is as follows:

Where:

n_a⁽ⁱ⁾ is number of the ith category outcomes in facet a: for example n_a⁽⁰⁾ is number of facet a acceptances.
n_d⁽ⁱ⁾ is number of the ith category outcomes in facet d: for example n_d⁽²⁾ is number of facet d rejections.

The range of TVD values for binary, multicategory, and continuous outcomes is [0, 1), where:
- Values near zero mean the labels are similarly distributed.
- Positive values mean the label distributions diverge, the more positive the larger the divergence.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Lp-norm (LP)

Kolmogorov-Smirnov (KS)