The rise of tabular foundation models
Moving from gradient boosting to in-context learning
Tree-based boosting algorithms have sat on the tabular throne for many years now. Many times, the deep learners have attempted to dethrone XGBoost, CatBoost, and other tree-based algorithms, but without success.
Then came the Transformer, and with it, a new hope for the deep learners. Many different architectures and algorithms for tabular data have been proposed: CARTE, XTaB, TabLLM, TabPFN, TabICL, …
Could one of those finally beat the tree-based algorithms?
Unlike previous attempts that tried to beat trees at their own game (train & predict), the most interesting newcomers are changing how we play the game (pre-train & learn in-context).
I’m talking about TabICL and TabPFN, which are both based on Prior-Data Fitted Networks (PFNs). They have shown remarkable performance, often outperforming boosted trees, as shown in this graph:

However, PFNs are not just one more algorithm in scikit-learn, but they turn the way we model tabular data upside down.
It’s time we talk about why tabular ML is (maybe) getting weird.
TabPFN and TabICL
Both TabPFN and TabICL are so-called Prior-Data Fitted Networks (PFNs), meaning that we pre-train them on lots of tabular data, and during the prediction step, we provide the entire “training data” plus labels and the test data for which we want the predictions. To call it training data at this point is misleading; it’s more useful to think of this step as In-Context Learning, where the specific “training data” becomes context data. This is also a bit of an issue, since the context window has limitations. The following shows a usage example for TabICL, which uses the sklearn API-style. Note that while I provide “training data” in the “fit” step, no training happens here.
PFNs can be thought of as a framework to train neural networks to perform Bayesian inference. The prior distribution is represented by the data used for pre-training, and the prediction is the posterior distribution. PFNs could also be initiated with other data: For example, if you train a PFN with sine waves, it will learn to predict sine waves. TabICL and TabPFN are specific PFNs trained with tabular data.
PFNs change how we model
PFN-style modeling turns the classic tabular ML on its head. It replaces the classic training + prediction with pre-training + in-context learning. In a way, PFNs shift the process by one abstraction level: The PFN pre-training is about “learning to learn,” and the prediction phase is about ingesting labelled and unlabelled data at once and doing in-context learning.
Breaking the classic train+predict has a lot of implications:
With tabular PFNs, you start with a pre-trained model and download the weights, just like with open-source LLMs.
You don’t need to train anything. You just throw your labelled and unlabelled data towards the PFN and get the predictions out. This also means you no longer need to tune hyperparameters (except for setting a few inference-time parameters).
Since PFNs are based on neural networks, we can also fine-tune them on further datasets. For example, if you have mostly time series data, you can fine-tune.
Each time we want to make a prediction, we have to provide the entire “training data” (context data). This makes it expensive if you can’t batch your predictions, but only want one prediction at a time.
As a result of requiring pre-training, we might see more closed models or licenses on the weights.
These changes are really shaking things up. But it gets weirder.
PFNs are trained on synthetic data
I would have expected that TabPFN and TabICL are pre-trained on many of the classic tabular dataset suites, like those from OpenML.
To my surprise, tabular data repositories are not the main source of training data. Instead, it’s synthetic data. Both TabPFN (v1 and v2) and TabICL are trained on lots of synthetic data. TabICL even exclusively.
This raises the question: How does one design a data-generating process that produces realistic-looking datasets, or at least datasets that are useful for pre-training?
The answer: structural causal models (SCMs).
Sampling a single dataset looks like this (very simplified):
Sample high-level parameters such as dataset size and number of features
Randomly create an acyclic graph that describes the causal structure of the features and the target
Sample data from that distribution
The pre-training is based on millions of unique, synthetic datasets generated in this way. A scale that would be difficult to match with real datasets. Even large repositories like OpenML only have 24k datasets.
As a side note, I find it very wild that pre-training on purely synthetic data can produce an algorithm that (sometimes) outperforms boosted tree ensembles on real data.
PFNs have crossed the threshold
TabICL and TabPFN truly seem to be very performant out-of-the-box (I also tried it on two datasets, and it was very promising). I’m currently seeing this development go one of two ways:
Scenario “Revolution”: These foundation models may become the state-of-the-art of how we do supervised machine learning for tabular data. In this case, we really should get started understanding these models better.
Scenario “One More Algo”: If no tabular revolution takes place, I still expect the tabular foundation models to stay, for three reasons: 1) They are highly performant. 2) They are super easy to apply. 2) Their prediction errors have a low correlation with other methods, making them excellent candidates in model ensembles.
No matter which scenario comes true, I strongly believe that it’s worth learning about PFNs.
Weirdly enough, these tabular foundation models have been flying under the radar. My theory: there have been several attempts to dethrone the tabular titans (xgboot, catboos, …), which often turned out to be a nothing-burger. If you cry wolf too many times, nobody will listen to you when it comes for real. Plus, TabPFN has lots of teething problems, like limitations on the number of rows and features (this is getting better), or that the predictions are sensitive to the order of features (again, more weirdness).
Anyways. From my perspective, tabular foundation models have crossed the threshold from “yet another tabular deep learning approach” to “this thing might at least stay and has a chance to revolutionize tabular ML“. That’s why I want to spend some time learning about TabPFN and TabICL. And I want to take you with me on that journey. This includes tutorials on using TabICL and TabPFN, understanding the underlying architecture, having a look at their interpretability, and trying to figure out best practices. More specifically, I’m thinking about creating a series of posts on TabPFN and TabICL, just like I did with Conformal Prediction.
If you are interested in TabPFN/TabICL, please leave a like, a comment, an e-mail, or any other indicator of interest so that I can justify putting more effort into this topic. 😁




Very interesting for me - I found TabPFN very useful where data size is limited (~100 samples or less). Also the stats functions/properties, e.g. generating synthetic data, are exciting but need better understanding.
I am very much interested - ran into TabPFN from the time series angle recently.