Deterministic overconfidence

Gal and Ghahramani (2016) warned against interpreting softmax probabilities as confidence scores. They empirically showed that passing a point estimate through the softmax activation function yields large probabilities, whereas passing a distribution of estimates through the softmax yields more reasonable, lower confidence scores. This deterministic overconfidence is partly due to what motivates learning a predictive distribution Mathematical formula showing conditional probability of y given x. , instead of a single prediction Mathematical equation showing y as a function of x, represented as y = f(x). .

Formally, the deterministic overconfidence conjecture can be detailed by the following inequality:

The Mathematical function H with parentheses, representing a hash function. operator represents Shannon's entropy, which is larger when elements of the input vector are more similar, and is therefore largest for uniform vectors. Thus, the previous equation states that the uncertainty, in terms of Shannon’s entropy Mathematical function H with parentheses, representing a hash function. , of the expected softmax probability vector from a Bayesian model Mathematical formula showing P(Bayesian) with superscript and subscript elements. (the average of a distribution), will be larger than or equal to the softmax probability vector from a deterministic model Mathematical equation showing deterministic probability as uppercase P with "deterministic" superscript. (from a model that produces a single point estimate). For a proof and demonstration of the inequality in the previous equation, see Appendix A.

Deterministic overconfidence affects the reliability and safety of our deep learning models. Consider the case where a model confidently predicts that an item on an assembly line isn’t defective, whereas, in fact, it is, resulting in the item skipping the quality review process. This faulty item might then be embedded into a larger product, compromising its integrity. At best, the end result is an inefficiency if the defect is caught down the line, or worse, a total failure of the product, if the defect isn’t found. Therefore, it is critical to understand and overcome deterministic overconfidence issues for the success of our projects, and for the future of deep learning.

Three ways to improve the quality of uncertainty measurements and overcome overconfidence are:

Calibrating softmax probabilities, post-hoc, with temperature scaling (Guo et al. 2017)
Approximating Bayesian inference by MC dropout (that is, keeping dropout on during inference) (Gal and Ghahramani 2016)
Approximating Bayesian inference with deep ensembles (Lakshminarayanan, Pritzel, and Blundell 2017)

Deterministic overconfidence is a theory that applies to both in-distribution and out-of-distribution data.¹ The next sections explain how to split the total quantifiable uncertainty² into its two constituent components: epistemic (model) uncertainty and aleatoric (data) uncertainty (Kendall and Gal 2017).

Notes

¹ In particular, rectified linear unit (ReLU) overconfidence has recently been found to be a significant contributor to overconfidence when data is far away from the decision boundary, especially when data becomes out of distribution (Hein, Andriushchenko, and Bitterwolf 2019). One suggested way to become robust against ReLU overconfidence is to model the information theoretic notion of aleatoric uncertainty (Gal and Ghahramani 2016, Hein, Andriushchenko, and Bitterwolf 2019, van Amersfoort et al. 2020), which is explained later in this guide.

² Some fields decompose total uncertainty into uncertainty that is quantifiable, and uncertainty that is not quantifiable. The discussion in this guide is limited to quantifiable uncertainty; therefore, the terms total uncertainty and total quantifiable uncertainty are used interchangeably.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Beyond the softmax probability

Decomposing uncertainty