Am I missing something when you talk about: ‘Datapoint attention attends to cells from other data points. So if you look at the vector representation for, say, cell x_12 (column 1, row 2),’. … don’t you mean row 1, column 2?
Thanks for the writeup. Wanted to point out something -- it's a bit confusing to call this an "encoder-decoder" architecture. When people say that about transformers, they are usually referring to two transformer stacks that handle sequences differently. TabPFN is an encoder-only transformer, with *feature* and *output* encoder/decoder, not to be confused with how T5 is an encoder-decoder.
Am I missing something when you talk about: ‘Datapoint attention attends to cells from other data points. So if you look at the vector representation for, say, cell x_12 (column 1, row 2),’. … don’t you mean row 1, column 2?
This article is phenomenal by the way. Thank you!
Good catch! I fixed it.
Thanks for the writeup. Wanted to point out something -- it's a bit confusing to call this an "encoder-decoder" architecture. When people say that about transformers, they are usually referring to two transformer stacks that handle sequences differently. TabPFN is an encoder-only transformer, with *feature* and *output* encoder/decoder, not to be confused with how T5 is an encoder-decoder.
Thanks for writing this, it clarifies a lot. Does cell-based abstraction scale for huge databasets? Brilliant insights!
Is there reason to think it could completely botch certain datasets? Real world datasets just too out of left field vs it's pretrained universe?
It's possible, and I anecdotally have heard it from other people who experienced this on their data. After all, there's no free lunch
Indeed
The usual suspects will probably say, "No really! This time it *is* a free lunch!!" Oh well