Machine learning algorithms to live by
The theories you learn shape how you see the world. A lawyer has the legal perspective in mind, an entrepreneur smells business opportunities everywhere, and a social worker might have a keen eye on social problems in any given situation.
But what about us machine learning practitioners?
Do we have ideas in machine learning that we can apply to real life?
I’ve argued that machine learning changes how we see the world and there are some lessons we can draw from machine learning for our lives if we try hard enough.
This post is inspired by the book “Algorithms to live by: The computer science of human decisions”, which does a similar generalization of ideas from computer science to real life.
Let’s go.
Success comes from focusing on failure
A recipe for future failure is to focus on past successes. You know, people stuck in iterating past glories to avoid focusing on improvement. A stark contrast to this is machine learning. Many machine learning algorithms have an almost masochistic focus on past mistakes:
Gradient descent gives more weight to wrongly predicted data points.
Gradient-boosted trees build decision trees iteratively. At each iteration, the model is trained on the residuals which are an embodiment of failure.
Sticking with trees: What happens when all the data in a node have the same class? That’s right, nothing. Because the tree-growing algorithm will focus on nodes with many wrong classifications.
Does that mean successes count for nothing? No. Successful predictions tell us how much weight to give to failed predictions. The model gets 10 predictions wrong. How much, for example, gradient boosting adapts in the next iteration depends on whether the predictions it gets right are 10 or 10,000.
The same applies to life: If you want to improve, focus on areas where you fail, but weight it by your successes. Do you want to improve your writing? Listen to the critics. One critic is asking for more references to unicorns, while everyone else is happy with the amount of unicorn references? Maybe dismiss this criticism. Half of your audience requests unicorns? Focus on unicorns.
Steer your career with stochastic gradient descent
Well, hopefully, your career is more of a stochastic gradient ascent, but fortunately, maximization and minimization are just a multiplication apart.
Anyways. Why does machine learning use stochastic gradient descent? Wouldn’t it be great to just force-feed neural networks with all data at once and find a globally optimal solution? We might even get more statisticians on board with deep learning if we could make that work. But unfortunately, it doesn’t.
Stochastic gradient descent helps the model escape local minima and makes the model training manageable in the first place when it comes to memory constraints and computational efficiency.
The same goes for your career. You could stubbornly hash out your ideal career that ends with you being the CEO of Google. Each milestone carefully planned. But the truth is that you don’t know whether this strategy leads to your optimal career — neither is it guaranteed that you become CEO nor that this is the best career for you. You might just get stuck in a local optimum consisting of free lunches and endless performance reviews.
What does a stochastic gradient ascent career look like? You follow a general direction that you want your career to develop in, but you also embrace a bit of noise in the process to kick you out of local optima. What do I mean by noise? LinkedIn recruiters, for example. Maybe accept an interview here and there.
My current stochastic gradient: I got myself involved in a rather complex water supply forecasting competition (with ML). I found it randomly when someone on Twitter mentioned it. I embraced it and found myself looking for more hydrology-related projects.
Put your trust in tight feedback loops
The very essence of supervised machine learning: The predictions are compared to ground truth. Training a machine learning model is essentially about creating a feedback loop between the predictor and this ground truth. This allows for training the models and creating feedback loops. It’s the basis of what makes (supervised) machine learning successful
We can apply that same lens to people and their tasks. You can learn to separate signals from noise, learn underlying patterns that generalize to new data, and compress knowledge. But only if there is a brutally honest and tight feedback loop that connects the expert’s prediction/decision with the actual outcome. Things that come to mind: anesthesiologists, having a conversation in a foreign language, tracking progress in the gym, prediction markets, juggling, salespeople, and baking bread.
This advice also has an inverse to it: Don’t blindly trust people who have no feedback. You know, writers like me who are just making things up on the go. Don’t trust my words blindly. But more seriously: One of the reasons I write this newsletter every week is to create a feedback loop that helps me with book writing.
Don’t overfit when buying stuff online
Having less information to make a decision is sometimes better. In machine learning, it’s often good to have a feature selection step, at least for tabular data, so that only relevant features are presented to the ML algorithm. This allows training models on more signal than noise and gives less opportunity to overfit the training data. Also, regularization is a way to produce models that are less swayed by extravagant data points and features.
Now let’s talk about your buying behavior on Amazon. Have you ever fallen into the trap of over-researching buying decisions for items that are like $5 or even less? I certainly have. If you want to buy a pen, you shouldn’t get lost in the weeds because a random Amazon reviewer wrote a memoir about how the pen stopped working and what all of this has to do with her uncle whom she hasn’t been in touch with for over 20 years. Knowing that the pen has features >4 stars, >100 reviews, and a price<$5 should have been enough information for the decision.
Live a biased life
The No Free Lunch (NFL) theorem in machine learning says that there is no best predictor when we average the performance across all possible problems. Instead, all predictors perform equally well.
We have to look 3 times at this theorem to see its beauty.
At first look, it has a nihilistic shine. Does the No Free Lunch (NFL) tell us we can’t learn anything? Can’t we even beat the most non-sensical predictors?
The second look makes the NFL seem like a boring object that only philosophers with too much time on their hands get excited about. If we want to predict whether the sun will rise tomorrow, 'all possible problems' include universes in which the sun is replaced by a large energy-saving LED, or someone accidentally unplugs our simulation, or the sun is a German employee with a certain number of contractually agreed holidays. Based on experience, we don’t live in such a wild universe and there are regularities we can rely on, like physical laws, AI influencers hyping each new LLM release, and the sun’s constant hustle.
It’s the third glance that makes one see the beautiful implications that the NFL offers. Nothing about the future is guaranteed. We only get to reach into the future by making assumptions about the state of the world. We must assume that some regularities of the past will hold in the future. We have to take an inductive leap.
To act means to make assumptions as well. You can’t know for sure if they were the right ones until after the fact. Don’t be afraid to have some inductive biases. Work out your theories about life and the universe. Live your life with conviction. Maybe you will get some lunch in return.