LIME was my first step into the world of interpretable machine learning.
At the time (2019) there weren’t many methods to explain the predictions of machine learning models. Local interpretable model-agnostic explanations (LIME) was a wonderful suggestion at this time.
What I love most about LIME is the intuition behind it:
We have a complex machine learning model with some arbitrary prediction function. Not really interpretable. But what if we zoom into one prediction for one input data point? Maybe then the prediction function looks simple enough and we can approximate it with a simpler model. Like a linear regression model.
And that’s the idea behind LIME: a local surrogate model.
But there’s one problem for which I haven’t seen a good solution yet: How close should we “zoom in”? Or in other words: What does local mean?
But first, let’s see how LIME works.
How LIME works
The goal of LIME is to fit a local surrogate model. These are the steps to get there:
Select your instance of interest for which you want to have an explanation of its model prediction.
Perturb your dataset and get the model predictions for these new points.
Weight the new samples according to their proximity to the instance of interest.
Train a weighted, interpretable model on the dataset with the variations.
Explain the prediction by interpreting the local model.
The tricky part occurs in “Weight the new samples.” Weight them how? LIME uses a so-called exponential kernel function that takes two data points (here x1 and x2) and computes a distance between them. The kernel has one hyperparameter: the kernel width. The exponential kernel is defined like this:
There’s nothing unusual about this choice, and the exponential kernel can also be found elsewhere in machine learning. However, this innocent-looking kernel function is problematic when used in the context of explaining predictions. And that has to do with actually defining what a local explanation is.
What does “local” mean?
Let’s talk about two of the main problem regarding defining locality:
Problem #1: The kernel treats distances in all feature directions the same. In part that is of course solvable by standardizing all features to a similar scale. But the problem remains that some of the features might be completely irrelevant to your problem, but still get a large “say” in the distance computation. That’s a typical problem in unsupervised learning as well. Also, the curse of dimensionality strikes: the volume of the space grows exponentially with the number of features and the data becomes increasingly sparse making it difficult to compute meaningful distances.
But what I find really problematic is this:
Problem #2: The σ. The σ is the kernel width and controls the smoothness of the kernel. Small values of σ result in a more peaked kernel so that only data points very close to the data point to be explained (x1) get a large weight. Large values of σ spread out the weights more. The kernel width, therefore, controls the locality of the explanations.
How do you pick the right kernel width? Let’s consider the extreme options first: Very large σ and very small σ.
If we pick a very small σ, then only very close points get a large weight. This makes LIME more of a gradient method when we use linear regression as the surrogate model.
And if we make σ very large, all points get a similar weight, so we go toward a global surrogate model. It’s like giving weight 1 to every perturbed sample.
Have a look at the following figure, with feature x1 and predicted value:
Alright, so that doesn’t look good. And in this example, we can at least see what’s happening and would maybe have a preference for kernel width 0.25 or so. But good luck when adding more features, some categorical, some with very different scales, and so on. Then we can’t just “have a look” at the prediction function and guess the appropriate kernel width.
So, when using the LIME Python package, how is the kernel width question solved? Let’s have a look at the Python code:
if kernel_width is None:
kernel_width = np.sqrt(training_data.shape[1]) * .75
So if the user does not specify a width, it’s automatically the root of the number of features times a magical 0.75. Where does the 0.75 come from? Unclear. At least to me. And the kernel width is highly application and data-dependent. But it’s not clear how to correctly pick it. So if you do nothing, you get this default choice. And if you pick it on your own, there’s no clear advice on how to do it.
But the kernel width is very essential for LIME explanations. As you can see in the example above you can arrive at opposite explanations, which are both “correct” but have to be interpreted with different locality contexts. The kernel defines how large the neighborhood is (in a smooth sense).
As long as you don’t know how to pick a good kernel width, I wouldn’t use LIME. At least not for tabular data.
Others (also this paper) have argued that we just have to make the kernel width large enough, because especially for small values we are in a “small bandwidth regime” where the LIME coefficients can even become all zero. But I disagree, since it’s still unclear what the right value should be, and having large values leads to a more global surrogate model, defeating the purpose of LIME.
Where does that leave us? There are plenty of other options that don’t require to specify a neighborhood. Like Shapley values, counterfactual explanations, what-if analysis. If you are interested in a book about Shapley values: I’m writing one! Get notified when it’s available.
Further references
You can find many interpretation methods in my book Interpretable Machine Learning
Learn more about LIME in a free chapter of the Interpretable Machine Learning book
Read more about the neighborhood problem by Philipp Kopper
Thanks! Great content!!