Local interpretability
The most popular methods for local interpretability of complex models are based on either Shapley Additive Explanations (SHAP) [8] or integrated gradients [11]. Each method has a number of variants that are specific to a model type.
For tree ensemble models, use tree SHAP
In the case of tree-based models, dynamic programming allows for fast and exact
computation of the Shapley
values
For neural networks and differentiable models, use integrated gradients and conductance
Integrated gradients provide a straightforward way to compute feature attributions in
neural networks. Conductance builds on integrated gradients to help you interpret attributions
from portions of neural networks such as layers and individual neurons. (See [3,11], implementation is at https://captum.ai/
For all other cases, use Kernel SHAP
You can use Kernel SHAP to compute feature attributions for any model, but it is an
approximation to computing the full Shapley values and remains computationally expensive (see
[8]). The computational resources required for Kernel SHAP
grow quickly with the number of features. This requires approximation methods that can reduce
the fidelity, repeatability, and robustness of explanations. Amazon SageMaker Clarify provides
convenience methods that deploy prebuilt containers for computing Kernal SHAP values in
separate instances. (For an example, see the GitHub repository Fairness and Explainability with SageMaker Clarify
For single tree models, the split variables and leaf values provide an immediately explainable model, and the methods discussed previously do not provide additional insight. Similarly, for linear models, the coefficients provide a clear explanation of model behavior. (SHAP and integrated gradient methods both return contributions that are determined by the coefficients.)
Both SHAP and integrated gradient-based methods have weaknesses. SHAP requires attributions to be derived from a weighted average of all feature combinations. Attributions obtained in this way can be misleading when estimating feature importance if there is a strong interaction between features. Methods that are based on integrated gradients can be difficult to interpret because of the large number of dimensions that are present in large neural networks, and these methods are sensitive to the choice of a base point. More generally, models can use features in unexpected ways to achieve a certain level of performance and these can vary with the model—feature importance is always model dependent.
Recommended visualizations
The following chart presents several recommended ways to visualize the local interpretations that were discussed in the previous sections. For tabular data we advise a simple bar graph that shows the attributions, so they can be easily compared and used to infer how the model is making predictions.
For text data, embedding tokens leads to a large number of scalar inputs. The methods recommended in the previous sections produce an attribution for each dimension of the embedding and for each output. In order to distill this information into a visualization, the attributions for a given token can be summed. The following example shows the sum of the attributions for the BERT-based question answering model that was trained on the SQUAD dataset. In this case, the predicted and true label is the token for the word “france.”
Otherwise, the vector norm of the token attributions can be assigned as a total attribution value, as shown in the following example.
For intermediate layers in deep learning models, similar aggregations can be applied to conductances for visualization, as shown in the following example. This vector norm of the token conductance for transformer layers shows the eventual activation for the end token prediction (“france”).
Concept activation vectors provide a method for studying deep neural networks in more detail [6]. This method extracts features from a layer in an already trained network and trains a linear classifier on those features to make inferences about the information in the layer. For example, you might want to determine which layer of a BERT-based language model contains the most information about the parts of speech. In this case, you could train a linear part-of-speech model on each layer output and make a rough estimate that the best performing classifier is associated with the layer that has the most part-of-speech information. Although we do not recommend this as a primary method for interpreting neural networks, it can be an option for more detailed study and aid in the design of network architecture.