The Intricate Link Between Compression and Prediction
How Gzip and K-Nearest Neighbors Can Outperform Deep Learning Models
You can beat deep learning for text classification with gzip + a short Python script, according to this paper.1
The idea: Use k-nearest neighbors based on compressed documents.
Here’s how it works:
Compress (zip) the new, unlabeled document and measure file size (= C(x)).
Compress 1 training document and measure file size (= C(y)).
Concatenate the new document and the training document, compress the resulting doc, and measure file size (= C(xy)).
Compute the normalized compression distance:
\(NCD(x,y) = \frac{C(xy) - min\{C(x),C(y)\}}{max\{C(x),C(y)\}}\)
Repeat this procedure for all documents in training data. This gives you the distances between your new document and all training docs. Pick the k documents with the smallest NCD and classify the new document according to the majority class of the k documents.
But why does this work at all?
From Compression to Prediction
If you haven't heard about the amazing connection between compression and prediction — you're in for a treat.
Prediction and compression are two sides of the same coin. An intuitive explanation: If you have a good theory about the world, then you can compress it but also predict it well.
Let’s say you have the sequence: 2, 4, 8, 16, 32, 64, 128, 256, 512
To compress this sequence, we could write a short program to reproduce it. The shorter the program, the better. This brings us to the realm of Kolmogorov complexity, which tells us to measure the complexity of information not based on the information itself, but by the complexity of the algorithm that produced it.
A short program to compress this sequence would be:
for i in range(9):
print(2 ** (i + 1))
We know that 1024 would be a good prediction for the next number in the sequence. The program above comes in handy as well for prediction: We could just increase the 9 to a 10. By writing the program we have to find the “rules” that produced the sequence. And these same rules are useful for predicting the sequence If you can compress a sequence well, then you can usually predict well. Not guaranteed, but almost always.2
Prediction Can Mean Compression
The relation also goes in the other direction: Good predictors can be good compressors. For example, language models are compressors.3
Stable diffusion, the image-generating AI, has a file size of 10-20GB. That’s the file size of just 2000 photos (with 5MB). However, the generative model compresses a lot of information. If you’ve used AI image generators, you know what I mean.
If you have found a good predictor, your predictor encodes the rules that govern the relation between features and target. However, it’s usually not the smallest program, if you think of random forests and neural networks.
Most of the literature I found was about generative models like language models. But a big part of machine learning is discriminative models, for example for classification or regression. They of course model a much simpler distribution since they usually just model E[Y|X] and don’t have to generate a multi-variate data point like an image or a text.
I think of discriminative models still as lossy compression, as they compress the dataset into a, usually, smaller program that models the relation between Y and X. And just like MP3, it’s a lossy compression. MP3 is a compression that cuts out acoustic signals that most people can’t hear. Discriminative models also cut out anything unhelpful in predicting Y.
And there are even more links between compression and prediction: If you want to show that something isn’t learnable (→ ML useless for the task), it’s sufficient to show that you can’t compress it.4 You can study ML algorithms through the lens of compression. As happened for AdaBoost here, Chapter 4.2.5
The bridge between compression and prediction resonates a lot with me. It gives an interesting view on ML interpretability: By linking compression and prediction, you can motivate why we want to interpret the model. The model can be seen as capturing a good compression of the world. By interpreting the model, we try to learn about that compression.
“Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors (Jiang et al., Findings 2023)
Vitányi, P., Li, M. (1997). On prediction by data compression. In: van Someren, M., Widmer, G. (eds) Machine Learning: ECML-97. ECML 1997. Lecture Notes in Computer Science, vol 1224. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-62858-4_69
Delétang, Grégoire, et al. "Language modeling is compression." arXiv preprint arXiv:2309.10668 (2023).
David, Ofir, Shay Moran, and Amir Yehudayoff. "On statistical learning via the lens of compression." arXiv preprint arXiv:1610.03592 (2016).
Schapire, Robert E., and Yoav Freund. "Boosting: Foundations and algorithms." Kybernetes 42.1 (2013): 164-166.
Thank you, Christoph, for this very good explanation.
In particular, I love how you make a concerted effort to also give the intuition behind the points you presented. All too often, it's just "well, the math says X" - but many people have a hard time understanding what X really means, and then what it's implications are, and so on.
I love this connection between compression and prediction, it gives another lens to understand how these models work.
Shamless plug but I covered the Gzip + Knn paper and Language Modeling is Compression papers in two of my articles.
[1]: https://codeconfessions.substack.com/p/decoding-the-acl-paper-gzip-and-knn
[2]: https://codeconfessions.substack.com/p/language-modeling-is-compression