Why you (probably) shouldn't use LIME to explain model predictions

It's a neighborhood thing

Mar 28, 2023

LIME was my first step into the world of interpretable machine learning.

At the time (2019) there weren’t many methods to explain the predictions of machine learning models. Local interpretable model-agnostic explanations (LIME) was a wonderful suggestion at this time.

What I love most about LIME is the intuition behind it:

We have a complex machine learning model with some arbitrary prediction function. Not really interpretable. But what if we zoom into one prediction for one input data point? Maybe then the prediction function looks simple enough and we can approximate it with a simpler model. Like a linear regression model.

And that’s the idea behind LIME: a local surrogate model.

But there’s one problem for which I haven’t seen a good solution yet: How close should we “zoom in”? Or in other words: What does local mean?

But first, let’s see how LIME works.

How LIME works

The goal of LIME is to fit a local surrogate model. These are the steps to get there:

Select your instance of interest for which you want to have an explanation of its model prediction.
Perturb your dataset and get the model predictions for these new points.
Weight the new samples according to their proximity to the instance of interest.
Train a weighted, interpretable model on the dataset with the variations.
Explain the prediction by interpreting the local model.

The tricky part occurs in “Weight the new samples.” Weight them how? LIME uses a so-called exponential kernel function that takes two data points (here x1 and x2) and computes a distance between them. The kernel has one hyperparameter: the kernel width. The exponential kernel is defined like this:

\(K(x_1, x_2) = exp\left(-\frac{\|x_1-x_2\|^2}{2\sigma^2}\right) \)

There’s nothing unusual about this choice, and the exponential kernel can also be found elsewhere in machine learning. However, this innocent-looking kernel function is problematic when used in the context of explaining predictions. And that has to do with actually defining what a local explanation is.

What does “local” mean?

Let’s talk about two of the main problem regarding defining locality:

Problem #1: The kernel treats distances in all feature directions the same. In part that is of course solvable by standardizing all features to a similar scale. But the problem remains that some of the features might be completely irrelevant to your problem, but still get a large “say” in the distance computation. That’s a typical problem in unsupervised learning as well. Also, the curse of dimensionality strikes: the volume of the space grows exponentially with the number of features and the data becomes increasingly sparse making it difficult to compute meaningful distances.

But what I find really problematic is this:

Problem #2: The σ. The σ is the kernel width and controls the smoothness of the kernel. Small values of σ result in a more peaked kernel so that only data points very close to the data point to be explained (x1) get a large weight. Large values of σ spread out the weights more. The kernel width, therefore, controls the locality of the explanations.

How do you pick the right kernel width? Let’s consider the extreme options first: Very large σ and very small σ.

If we pick a very small σ, then only very close points get a large weight. This makes LIME more of a gradient method when we use linear regression as the surrogate model.

And if we make σ very large, all points get a similar weight, so we go toward a global surrogate model. It’s like giving weight 1 to every perturbed sample.

Have a look at the following figure, with feature x1 and predicted value:

Feature x1 on the x-axis and the prediction on the y-axis. The prediction function follows a type of curve that goes first down, then up, then down and up again. Each point is a predicted value. For x1=13 there are multiple lines that represent different LIME coefficients with different kernel widths. small kernel widths make the line have a negative slope, which makes sense since for x1=13 there is a negative slope in the prediction function. Increasing the sigma makes the slope become positive. — Predicted values and different LIME explanations for data point x1=13. The explanations use different kernel widths and therefore range from gradient to global model. Image by Philip Kopper, “LIME and Neighborhood” 2020.

Alright, so that doesn’t look good. And in this example, we can at least see what’s happening and would maybe have a preference for kernel width 0.25 or so. But good luck when adding more features, some categorical, some with very different scales, and so on. Then we can’t just “have a look” at the prediction function and guess the appropriate kernel width.

So, when using the LIME Python package, how is the kernel width question solved? Let’s have a look at the Python code:

if kernel_width is None:
  kernel_width = np.sqrt(training_data.shape[1]) * .75

So if the user does not specify a width, it’s automatically the root of the number of features times a magical 0.75. Where does the 0.75 come from? Unclear. At least to me. And the kernel width is highly application and data-dependent. But it’s not clear how to correctly pick it. So if you do nothing, you get this default choice. And if you pick it on your own, there’s no clear advice on how to do it.

But the kernel width is very essential for LIME explanations. As you can see in the example above you can arrive at opposite explanations, which are both “correct” but have to be interpreted with different locality contexts. The kernel defines how large the neighborhood is (in a smooth sense).

As long as you don’t know how to pick a good kernel width, I wouldn’t use LIME. At least not for tabular data.

Others (also this paper) have argued that we just have to make the kernel width large enough, because especially for small values we are in a “small bandwidth regime” where the LIME coefficients can even become all zero. But I disagree, since it’s still unclear what the right value should be, and having large values leads to a more global surrogate model, defeating the purpose of LIME.

Where does that leave us? There are plenty of other options that don’t require to specify a neighborhood. Like Shapley values, counterfactual explanations, what-if analysis. If you are interested in a book about Shapley values: I’m writing one! Get notified when it’s available.

Further references

You can find many interpretation methods in my book Interpretable Machine Learning
Learn more about LIME in a free chapter of the Interpretable Machine Learning book
Read the original LIME paper
Read more about the neighborhood problem by Philipp Kopper

Mindful Modeler

Discussion about this post