Points, Rules, Weights, Distributions: The Elements of Machine Learning
How all (?) ML models boil down to 4 learnable elements
In this post, I explore how to deconstruct machine learning, an idea from my upcoming book Elements of Machine Learning Algorithms.
Neural networks are often built from well-defined building blocks with weight tensors as the learnable elements. That made me wonder, what are the building blocks of all the other machine learning models? I narrowed it down to the elements that are actually learnable: For example, the logit function at the end of a logistic regression model is fixed, not learned.
But which elements remain if we put random forests, CNNs, and k-nearest neighbors into a pressure cooker? What’s left are four learnable elements that make up most of modern machine learning. You can think of them as ways of storing and applying patterns to (intermediate) features.
The Four Elements
Points: Models like k-nearest neighbors and k-means store patterns directly as points in feature space. These points can be instances from the training data, but also newly created ones, such as cluster centers. Making predictions (or clustering) involves measuring distances between new data points and the stored ones and working with the closest matches. Only using points means we restrict ourselves to the “raw” feature space without any newly learned features.
Rules: These are if-then structures where conditions in the form of feature=value, combined with ANDs, lead to a prediction. Models such as decision trees, rule lists, rule sets, random forests, and boosted trees all rely on if-then rules. Applying decision rules means checking which rules a new data point matches and using the corresponding predictions. Rule-based models thrive by packing rules in parallel. Ensembles of trees through bagging and boosting are prime examples, and on top of the (tabular) food chain. Rule-based machine learning works particularly well for tabular data where features have clear meaning.
Weights: Weight tensors are structured sets of numbers and are used in a wide range of models, from linear models to transformers. To apply the patterns stored in a weight tensor, we multiply the weight tensor by (intermediate) features. Weight-based models can be built sequentially. Each layer transforms features before passing them on. Weight-based machine learning is typically combined with non-linear transformation functions to learn complex models. Convolutional neural networks are just one example. Weight-based machine learning, specifically deep learning, works well for non-tabular data such as images and text.
Distributions: These describe how variables behave probabilistically. Naive Bayes, the (semi-parametric) proportional hazard model, and Gaussian processes all store patterns as distributions. Predictions are made by applying probability laws, often using the conditional expectation of the target given the features. Some distribution-based models “collapse” into weight-based representations. For example, assuming a normal distribution of a target conditional on a linear combination of features yields a linear model, which we can represent with a weight vector.
Each of these storage units forms its own “language” of representation. However, they are sometimes combined: Think of Gaussian Mixture Models (points + distribution), model-based trees (rules + weights), and RBF SVM (points + weights). But what about evaluation and optimization, the other ingredients, apart from representation?
Differences in Evaluation
Once we represent a model with points, rules, weights, or distributions, we also need different tools for evaluation and optimization, at least to a degree. Of course, the type of supervised evaluation, such as measuring the accuracy for classification tasks or MSE for regression tasks, is independent of the representation we use, but there are some representation-specific evaluation criteria:
Rules are often evaluated in terms of coverage and accuracy, and their trade-off between generality and specialization. When it comes to ensembling decision trees, we want diverse trees.
Weight-based machine learning models have the special advantage that we often evaluate the model using the same metric that we use to optimize them.
Distribution-based models have their own logic and are often evaluated based on likelihoods and likelihood ratios.
Differences in Optimization
When it comes to optimization, some procedures work independently of the elementary model representation: For example, we can use evolutionary algorithms no matter whether the model is based on points, if-then rules, weights, or distributions (though it wouldn’t be very effective). And, of course, “external” optimization, aka hyperparameter tuning and model selection, can also be done independently of the representation. But for internal optimization (aka training or learning the model), each representation needs a different optimization strategy:
With point-based machine learning, we do a (simple) optimization during the prediction time, since we typically have to search for the closest data points. Doing optimization during prediction is special to this form of representation. Another odd optimization fact: We can even do without any internal optimization during training for the case of k-nearest neighbor models, also called lazy learning. If we, however, create new data points, optimization is needed.
Rule-based machine learning requires discrete, combinatorial optimization. Even just creating a single rule puts you in combinatorial hell: With just 10 binary features, there are already 3^10=59049 possible rules. This doesn’t even include all the possibilities of how we binarize non-binary features. So optimization is typically done in a “greedy” way, one IF-condition after the other. This greedy optimization is either wrapped in the sequential covering algorithm (if growing a decision list or set) or applied recursively (for decision trees).
Weight-based machine learning has the big advantage that we can often directly optimize a loss function of our choice, as long as it has a derivative we can work with. And by applying the chain rule, we can do this for very complex models (deep learning).
Distributions follow their own logic and language yet again. First, we often have to make an assumption about the general form of the distribution. Then we optimize the likelihood for the parameters of that distribution given the training data. Sometimes, this leads to a weight-based representation, for which we can use gradient-based optimization methods. Sometimes there is even an exact solution (like for linear regression models).
If you like such a deconstructed view, you might be interested in my new book project called “The Elements of Machine Learning Algorithms”. Even though it seems more and more that I am writing an (alternative) Introduction to Machine Learning, it contains a lot of insights for more advanced modelers as well. At least I’m learning a lot of new stuff writing the book, by deconstructing ML into elements and seeing commonalities between algorithms that I haven’t realized before. If you want to be up to date on the writing process, and maybe even get the book for cheaper, sign up for updates here:



Since you just use the term “evolution”, it’s hard not to say yes; especially when you use the qualifier “somehow” which can cover a lot off ground. I hesitate because I have a bit of a peeve concerning the loose usage of the term evolution.
So, on one hand we have the theory of evolution through natural selection. Darwin himself was not very fond of that name. “Evolution” comes from the Latin “evolutio” meaning “unrolling”. This suggests that the process has an aim or pattern or progression but Darwin believed that speciation proceeded by random drift toward whatever “worked” in a particular environment: if simplicity improved survivability then organisms would become simpler, if complexity improved survivability then organisms would become more complex. The pillars of the theory of evolution by natural selection are: heritability (children resemble their parents), random variability (children are not exact copies of their parents but the differences are random), and survivability/selective pressure (not all offspring will survive and heritable traits have an effect on survival rates). Darwin proposed that these mechanisms together were sufficient to explain how all life could have come from a single common ancestor. This is the evolution that “scientists believe in” and creationists get all excited about.
On the other hand we have the more general sense of evolution meaning development or change in general. There is “Lamarckian evolution” and “directed evolution”. These are all different but get mixed together in common speech.
++ Good Post, Also, start here Compilation of 100+ Most Asked System Design, ML System Design Case Studies and LLM System Design
https://open.substack.com/pub/naina0405/p/important-compilation-of-most-asked?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false