I have received alarmed messages about the paper “Impossibility Theorems for Feature Attribution”. Some people saw the paper and concluded that SHAP is completely unreliable.
Is SHAP doomed?
Let’s discuss “Impossibility Theorems for Feature Attribution”: what does it actually study, and how does it affect recommendations regarding SHAP and other interpretation methods?
Bilodeau et al. (2024) studied the limitations of feature attribution methods like SHAP and Integrated Gradients. They specifically emphasize the limitations of “complete” and linear methods. These are methods that explain predictions as a sum of attributions:
Each phi (except the one indexed with zero) indicates how much the respective feature contributed to the prediction.
In 38 pages, they demonstrate with theorems and experiments that SHAP and Integrated Gradients are “unreliable” and “fail to improve on random guessing for inferring model behavior.” This surely doesn’t sound good for practitioners who use SHAP and Integrated Gradients.
But we have to look closer at the tasks for which they studied interpretation methods:
Algorithmic Recourse: For example, when someone gets a bad credit score and wants to know what they could change to get a better score.
Characterize model behavior in the neighborhood of the data point.
Identify spurious feature identification: This is also defined as a local task. Is the model sensitive to local perturbations in the features?
These three tasks have one thing in common: They all ask, “How does the prediction for a certain data point change when we change the features just a little bit?”
For this use case, I 100% agree that no one should be using SHAP, as I also explain in my book Interpreting Machine Learning with SHAP. One of the reasons I wrote this book is to highlight such limitations of SHAP. But if I believed that SHAP shouldn’t be used at all, I wouldn’t have written it but just a critical blog post.
The reason why you shouldn’t use SHAP for these tasks can be explained briefly: SHAP compares the prediction to the prediction of a background (aka baseline) dataset and not how small changes in the features change the prediction.
See the following visualization: We want to explain the prediction for the red dot with SHAP. The prediction is rather low, and the feature value is too. The SHAP value for this feature would be negative since replacing the feature value with any value from the baseline data would increase the prediction. However, locally, when we slightly increase or decrease the feature, we find that the prediction gets smaller, and from a pure “neighborhood” perspective we might attribute a positive effect to that feature value.
For me, this paper didn’t change anything about how I see SHAP. I guess for many others, the paper didn’t come as a surprise either. Do people use SHAP for some kind of counterfactual analysis? Bilodeau et al. (2024) cite five papers where SHAP is used in such a way, so it seems that this misuse of SHAP happens sometimes.
tl;dr: Don’t use SHAP for counterfactual questions or any questions about “slightly changing feature values of a data point.”
Could you elaborate further on the definition of counterfactual questions?