Deterministic overconfidence
Gal and Ghahramani (2016) warned against interpreting softmax probabilities as confidence scores. They empirically showed that passing a point estimate through the softmax activation function yields large probabilities, whereas passing a distribution of estimates through the softmax yields more reasonable, lower confidence scores. This deterministic overconfidence is part due to what motivates learning a predictive distribution , instead of a single prediction .
Formally, the deterministic overconfidence conjecture can be detailed by the following inequality:
The operator represents Shannon's entropy, which is larger when elements of the input vector are more similar, and is therefore largest for uniform vectors. Thus, the previous equation states that the uncertainty, in terms of Shannon’s entropy , of the expected softmax probability vector from a Bayesian model (the average of a distribution), will be larger than or equal to the softmax probability vector from a deterministic model (from a model that produces a single point estimate). For a proof and demonstration of the inequality in the previous equation, see Appendix A.
Deterministic overconfidence affects the reliability and safety of our deep learning models. Consider the case where a model confidently predicts that an item on an assembly line isn’t defective, whereas, in fact, it is, resulting in the item skipping the quality review process. This faulty item might then be embedded into a larger product, compromising its integrity. At best, the end result is an inefficiency if the defect is caught down the line, or worse, a total failure of the product, if the defect isn’t found. Therefore, it is critical to understand and overcome deterministic overconfidence issues for the success of our projects, and for the future of deep learning.
Three ways to improve the quality of uncertainty measurements and overcome overconfidence are:

Calibrating softmax probabilities, posthoc, with temperature scaling (Guo et al. 2017)

Approximating Bayesian inference by MC dropout (that is, keeping dropout on during inference) (Gal and Ghahramani 2016)

Approximating Bayesian inference with deep ensembles (Lakshminarayanan, Pritzel, and Blundell 2017)
Deterministic overconfidence is a theory that applies to both indistribution and outofdistribution data.^{1} The next sections explain how to split the total quantifiable uncertainty^{2} into its two constituent components: epistemic (model) uncertainty and aleatoric (data) uncertainty (Kendall and Gal 2017).
^{1} In particular, rectified linear unit (ReLU) overconfidence has recently been found to be a significant contributor to overconfidence when data is far away from the decision boundary, especially when data becomes out of distribution (Hein, Andriushchenko, and Bitterwolf 2019). One suggested way to become robust against ReLU overconfidence is to model the information theoretic notion of aleatoric uncertainty (Gal and Ghahramani 2016, Hein, Andriushchenko, and Bitterwolf 2019, van Amersfoort et al. 2020), which is explained later in this guide.
^{2} Some fields decompose total uncertainty into uncertainty that is quantifiable, and uncertainty that is not quantifiable. The discussion in this guide is limited to quantifiable uncertainty; therefore, the terms total uncertainty and total quantifiable uncertainty are used interchangeably.