Quantitative comparison of uncertainty methods - AWS Prescriptive Guidance

Quantitative comparison of uncertainty methods

This section describes how we compared the methods for estimating uncertainty by using the Corpus of Linguistic Acceptability (CoLA) (Warstadt, Singh, and Bowman 2019) dataset. The CoLA dataset consists of a collection of sentences along with a binary indicator of whether they are acceptable. Sentences can be labeled as unacceptable for a variety of reasons, including improper syntax, semantics, or morphology. These sentences are taken from examples in linguistic publications. There are two validation sets. One validation set is taken from the same sources used in forming the training dataset (in domain), and the other validation set is taken from sources that aren’t contained in the training set (out of domain). The following table summarizes this information.

Dataset Total size Positive Negative

Training

8551

6023

2528

Validation (in domain)

527

363

164

Validation (out of domain)

516

354

162

The comparison uses a RoBERTa (Liu et al. 2019) base architecture with pretrained weights and a randomly initialized head with a single hidden layer. Hyperparameters are mostly suggested in the RoBERTa paper with a few minor modifications.