Document coverage and accuracy – out of domain domain - AWS Prescriptive Guidance

Document coverage and accuracy – out of domain domain

We also examined out-of-domain data, which was taken from syntax textbooks that weren’t used to source the training data. However, we didn’t observe a noticeable difference in relative performance. This is perhaps because the quantitative content of linguistic features would likely differ very little, although sentences are sourced from different textbooks.

The following chart provides a comparison of the most frequent linguistic features across the two data sources. It shows very little difference between the distributions of the in-domain and out-of-domain datasets. Furthermore, with respect to vocabulary, the model had at least some exposure with out-of-domain language during training on in-domain examples. All words found in the out-of-domain set had a frequency count of at least 100 over the entire training set (Warstadt, Singh, and Bowman 2019). Thus, the out-of-domain data wasn’t considered as truly out of distribution. For more information on the linguistic features, see Warstadt, Singh, and Bowman (2019).

Comparison of the most frequent linguistic features across the two data sources

To gain a better sense of the deep ensemble’s and Monte Carlo dropout’s ability to estimate uncertainty in a truly out-of-distribution setting, we created three adversarial datasets that included random words injected into the sentences:

  • One dataset with rare words not found in the training dataset

  • One dataset with non-English words not found in the training dataset

  • One dataset with a mixture of the previous two datasets

All of the injected words were present in the original vocabulary used for pretraining the model.

The following graph shows the correlation between accuracy and coverage for the third dataset. The first and second datasets show similar patterns.

Correlation between accuracy and coverage for adversarial datasets

The graph shows a clear benefit from using either MC dropout or deep ensembles with MC dropout for coverage levels above 40%. We suspect that these two methods show similar performance because the pretraining model doesn’t include much diversification. This opens the way for further investigations. The significant performance degradation for the naïve softmax method that occurs above 40% document coverage is likely because we altered approximately 55% of the validation set with our adversarial data generation process. In the low coverage region, the methods have similar accuracy values, because these sets of data aren’t out of distribution.