This is so helpful. Please give this view for partial dependency plots (PDP) as well. Also closely related are partial effect plots (PEP); I don't know if the intuition is different for these two. (Here's a brief explanation of the difference between the two: https://stats.stackexchange.com/questions/371439/partial-effects-plots-vs-partial-dependence-plots-for-random-forests). PEP seems to be more popular with those who model with the statistical mindset whereas PDP is more popular with the machine learning mindset.

For PDP, I'd first start with ICE curves. ICE stands for individual conditional expectation curves and the PDP is just an average of these curves.

A single ICE curves is made from one data point. The feature of interestm, say X1, is swapped out with different values of a grid and then the prediction for this data point is plotted on the y-axis (feature value on x-axis).

The decomposition view of an ICE curve: We keep all components without X1 in it the same. But we change X1, so all components with X1 change, so we get to see the main effect + interaction effects of X1 while keeping all else equal.

But what happens when we average these ICE curves across our data to get the Partial Dependence Plot?

Here it gets interesting, because you could say that by averaging we cancel out the interaction effects so that the PDP only shows the main effect of X1 (f1). But that depends which assumptions you use for the decomposition. There are decomposition for which the PDP is not an estimate of f1, for example when you would use the ALE decomposition.

Thanks for the response, but it's still not quite clear to me about ICE and PDP. I think a follow-up article on the functional decomposition of these two would be totally worthwhile. :-)

This is a beautifully intuitive explanation. However, I feel you left us hanging with ALE, which is precisely what interests me the most. So, I understand that for x2, ALE gives only the f_2(x2) component and nothing else. But then what about the interactions? Am I correct to understand that

* the simple ALE x2 score does not incorporate anything whatsoever of the x2 interactions; and

* the ALE interactions x1_x2 and x2_x3 map directly to the functional decompositions of f_12(x1,x2) and f_23(x2,x3)?

(I won't comment for now on the three-way interaction, since that is not yet well-developed for ALE).

This is so helpful. Please give this view for partial dependency plots (PDP) as well. Also closely related are partial effect plots (PEP); I don't know if the intuition is different for these two. (Here's a brief explanation of the difference between the two: https://stats.stackexchange.com/questions/371439/partial-effects-plots-vs-partial-dependence-plots-for-random-forests). PEP seems to be more popular with those who model with the statistical mindset whereas PDP is more popular with the machine learning mindset.

For PDP, I'd first start with ICE curves. ICE stands for individual conditional expectation curves and the PDP is just an average of these curves.

A single ICE curves is made from one data point. The feature of interestm, say X1, is swapped out with different values of a grid and then the prediction for this data point is plotted on the y-axis (feature value on x-axis).

The decomposition view of an ICE curve: We keep all components without X1 in it the same. But we change X1, so all components with X1 change, so we get to see the main effect + interaction effects of X1 while keeping all else equal.

But what happens when we average these ICE curves across our data to get the Partial Dependence Plot?

Here it gets interesting, because you could say that by averaging we cancel out the interaction effects so that the PDP only shows the main effect of X1 (f1). But that depends which assumptions you use for the decomposition. There are decomposition for which the PDP is not an estimate of f1, for example when you would use the ALE decomposition.

For PEP: I have to look into it.

edited Sep 28, 2023Thanks for the response, but it's still not quite clear to me about ICE and PDP. I think a follow-up article on the functional decomposition of these two would be totally worthwhile. :-)

This is a beautifully intuitive explanation. However, I feel you left us hanging with ALE, which is precisely what interests me the most. So, I understand that for x2, ALE gives only the f_2(x2) component and nothing else. But then what about the interactions? Am I correct to understand that

* the simple ALE x2 score does not incorporate anything whatsoever of the x2 interactions; and

* the ALE interactions x1_x2 and x2_x3 map directly to the functional decompositions of f_12(x1,x2) and f_23(x2,x3)?

(I won't comment for now on the three-way interaction, since that is not yet well-developed for ALE).

Yes, your understanding is correct. For a deep dive on ALE, have a look at the chapter in Interpretable Machine Learning: https://christophm.github.io/interpretable-ml-book/ale.html

Thanks!