5 questions to categorize machine learning interpretability approaches

Sep 13, 2022

My job is to keep up with research on interpretable machine learning.

But I fail.

It’s not me, it’s arxiv. The daily flood of papers is too much.

If interpretability researchers can’t keep up, how can a data scientist or machine learning engineer?

I’ve written dozens of in-depth chapters on ML interpretation techniques and read hundreds of papers. Over time, I have found some useful categories to map the space of interpretation approaches.

You can use these 5 questions to quickly assess ML interpretation approaches:

Point-wise or global interpretation?
Interpretable by design or post-hoc interpretation?
Is the explanation outcome a model?
For which models and data?
What is the output?

Often, you can answer these questions just by reading the methods’ paper abstract or a blog post about it.

Let’s dive in.

#1: Point-wise or global interpretation?

Point-wise (also called local) interpretations deal with individual predictions; global interpretations deal with the overall behavior of the model and are sometimes an average of point-wise explanations.

Examples of point-wise (local) interpretation:

Shapley values for individual predictions (like SHAP).
Counterfactual explanations
Grad-Cam and other gradient-based interpretation methods that produce image explanations like this one:

Examples of global interpretation:

Permutation feature importance
Partial dependence plots
SHAP dependence plot: An example of how an aggregate of point-wise explanations (Shapley values) can become a global interpretation.

Point-wise versus global is not a binary classification. The interpretation can also be applicable for subsets of the data:

Decision tree paths express prediction for a subset of the data.
LIME creates a local model around a data point. In theory, it explains the neighborhood of the point as well, but the explanation quality degrades with distance, so it is safest to use LIME as a point-wise interpretation.
Linear regression coefficients are point-wise and global because a global linear function is linear at each point.

As you can see with linear regression coefficients, the questions don’t always lead to a perfect categorization. And that’s just fine. Don’t try too hard to squeeze an interpretation method into the categories. If a method doesn’t fit into one of the categories, that’s already valuable information.

#2: Interpretable by design or post-hoc interpretation?

A model can be interpretable by design, or we apply a method to make it interpretable post-hoc, after training.

You can answer this question easily by this thought experiment: Someone else trains the model and gives it to you. Can you already interpret it? Or do you have to apply additional computations for the interpretation?

If you have to do additional computations, it’s usually a post-hoc interpretation method. For example:

If you don’t have to do additional computations, then the model is interpretable by design, for example:

Generalized additive models
Most decision tree and decision rule models
ProtoPNet, a neural network that dissects an image and uses prototypes for classification

The question is not perfect, because, for many models that are interpretable by design, you have to do some computations to get the interpretation out of the model. For GAMs, for example, to get these effect curves for non-linear feature effects, you have to turn the splines into curves. But this often happens automatically within the GAM software. So, again, take the question with a grain of salt.

Do you want to learn more about ML interpretation methods? Check out my book Interpretable Machine Learning. It’s the best deep dive into interpretability you can get — in my biased opinion.

Buy from Amazon

#3: Is the explanation outcome a model?

This question only makes sense for post-hoc methods (see #2).

Many explanation methods for black-box models produce models that are interpretable by design. The idea is: The original model is too complex, let’s interpret a simpler model that has similar predictions.

Examples of interpretation methods that produce models:

Global surrogate models
Partial dependence plots are also prediction functions. A 1D partial dependence plot marginalizes all features and is a prediction function of just one feature.
LIME fits a weighted linear regression model (or similar) around the data point of interest.

LIME creates a simple model around a point of interest, replacing the globally complex model with an interpretable one that is only applicable near the data point.

If the outcome is a model, we can predict with it and we can use model fidelity to measure how close these predictions are to the original model.

Most interpretation methods don’t produce models. Some methods in this diverse group:

Thinking that all explanation methods produce models is a common misconception about interpretable machine learning, and question #3 helps avoid this mistake.

#4: For which models and data?

Interpretation approaches have limitations on which model classes and data types they can be used with. Model-agnostic methods are the most widely applicable because they are not restricted in terms of model class. Examples of model-agnostic methods and their corresponding data types:

Accumulated Local Effect Plots for tabular data.
H-statistic for feature interaction for tabular data.
LIME and SHAP: image, text, and tabular data.

Model-agnostic methods are hugely popular because they don’t lock you in with your current model. LIME and SHAP are extra popular because they are not only model-agnostic, but can handle many different types of model input data.

Model-specific methods have the advantage that they are usually faster to compute because they can rely on information internal to the model, such as gradients. Some examples:

Grad-Cam: Gradient-based models and image data.
Gini Importance: Tree-based models and tabular data.
Monotonicity constraints: Boosting models and tabular data.

For models that are interpretable by design (#2), the interpretation is coupled with the model and therefore also the compatible data are defined by the model capabilities.

#5: What is the output?

Interpretation methods can have different types of outputs:

Feature effect. Explains how changing a feature, on average, changes the output. Examples: PDP and ALE. But also the coefficients of a linear regression model or interpretation of the splits in a decision tree.
Feature importance. The ranking of how relevant a feature is for the model outcome. Examples: Permutation feature importance and SHAP importance,
Attribution. Many point-wise interpretation methods attribute an individual prediction to the feature values. Examples Shapley values or Grad-Cam. Attributions are somewhere between importance and effect, as taking the absolute value can turn an attribution into an importance ranking.

Again, not all methods fall into these neat categories. For example, counterfactual explanations are closest to attributions, but don’t really fit into any of these categories.

Test Your New Knowledge

Do you want to try out this categorization? Try these two things:

Classify your favorite interpretation method with these 5 questions.
The ultimate test: Pick a new method from the latest publications on arxiv. You might be able to answer all questions just by reading the abstract of a paper.

For a deep dive into machine learning interpretability, conisider buying my book Interpretable Machine Learning. Or read it for free first.

Dr. Nels Lindahl

The flooding effect within machine learning research really does make it hard to sort out the signal from the noise. I read a lot of papers that get more attention in pre-print form than from a journal. On the other hand a lot of pre-print papers end up being a decent read, but do not advance the field or make a solid contribution to the academy.

Expand full comment

Mindful Modeler

Discussion about this post