Hi Christoph, thanks for writing a very useful article. I was wondering when you said "If X2 is a noisy copy of a strong feature X1, then by sampling the features that a node can use for splitting, in many cases X2 will be available, but not X1, and the split will be based on X2." why X2 would be picked over X1, since X1 is less noisy it would lead to a more homogenous split, right? Ideally, X1 should be picked more
Brilliant article!I'm a new reader, but I'm trying catching up on the older articles and books, and I'm enjoying them very much. I'm really enjoying this inductive bias series.
I have a question: Why do you think tabular data tends to exhibit non-smooth patterns in the underlying 'real' function? Coming from a background in physics, I'm accustomed to mostly using smooth and continuous functions to model reality. It feels unusual to expect a non-smooth model to perform better. Could you elaborate on this? Do you have any papers or information that could provide more insight into this? Thanks!
Good question. Just guessing: But let's say both smooth and non-smooth patterns occur in the "true" relationship between feature and target. Tree-based models can capture non-smooth patterns well, and they can at least approximate smooth-patterns. Maybe it's that neural networks can't approximate non-smooth patterns well enough. But just a speculation.
As for papers, I relied on "Why do tree-based models still outperform deep learning on tabular data?" (https://arxiv.org/abs/2207.08815) most for writing this post.
Hi Christoph, thanks for writing a very useful article. I was wondering when you said "If X2 is a noisy copy of a strong feature X1, then by sampling the features that a node can use for splitting, in many cases X2 will be available, but not X1, and the split will be based on X2." why X2 would be picked over X1, since X1 is less noisy it would lead to a more homogenous split, right? Ideally, X1 should be picked more
X1 will be picked more often. But due to subsampling of random forests X2 will be picked when the subsample contains X2 but not X1
Brilliant article!I'm a new reader, but I'm trying catching up on the older articles and books, and I'm enjoying them very much. I'm really enjoying this inductive bias series.
I have a question: Why do you think tabular data tends to exhibit non-smooth patterns in the underlying 'real' function? Coming from a background in physics, I'm accustomed to mostly using smooth and continuous functions to model reality. It feels unusual to expect a non-smooth model to perform better. Could you elaborate on this? Do you have any papers or information that could provide more insight into this? Thanks!
Thanks Pietro. That gives me a lot of motivation.
Good question. Just guessing: But let's say both smooth and non-smooth patterns occur in the "true" relationship between feature and target. Tree-based models can capture non-smooth patterns well, and they can at least approximate smooth-patterns. Maybe it's that neural networks can't approximate non-smooth patterns well enough. But just a speculation.
As for papers, I relied on "Why do tree-based models still outperform deep learning on tabular data?" (https://arxiv.org/abs/2207.08815) most for writing this post.