Beyond Glorified Curve Fitting: Exploring the Probabilistic Foundations of Machine Learning

An introduction to probabilistic thinking — and why it’s the foundation for robust and explainable AI systems. The post Beyond Glorified Curve Fitting: Exploring the Probabilistic Foundations of Machine Learning appeared first on Towards Data Science.

May 1, 2025 - 03:40

Beyond Glorified Curve Fitting: Exploring the Probabilistic Foundations of Machine Learning

You see a math formula you don’t immediately understand.

Your instinct? Stop reading.

Don’t.

That’s exactly what I told myself when I started reading Probabilistic Machine Learning – An Introduction by Kevin P. Murphy.

And it was absolutely worth it.

It changed how I think about machine learning.

Sure, some formulas might look complicated at first glance.

But let’s look at the formula to see that what it describes is simple.

When a machine learning model makes a prediction (for example, a classification), what is it really doing?

It’s distributing probabilities across all possible outcomes / classes.

And those probabilities must always add up to 100 % — or 1.

Let’s take a look at an example: Imagine we show the model an image of an animal and ask: “What animal is this?”

The model might respond:

Cat: 85%
Dog: 10%
Fox: 5%

Add them up?
Exactly 100%.

This means the model believes it’s most likely a cat — but it’s also leaving a small chance for dog or fox.

This simple formula reminds us that machine learning models can not only give us an answer (=It’s a cat!), but also reveal how confident they are in their prediction.

And we can use this uncertainty to make better decisions.

Own visualization — Illustrations from unDraw.com.

Table of Contents
1 What does machine learning from a probabilistic view mean?
2 So, what is supervised learning?
3 So, what is unsupervised learning?
4 So, and what is reinforcement learning?
5 From a mathematical perspective: What are we actually learning?
Final Thought — What’s the point of understanding the probabilistic view anyway?
Where Can You Continue Learning?

What does machine learning from a probabilistic view mean?

Tom Mitchell, an American computer scientist, defines machine learning as follows:

> A computer program is said to learn from experience E with respect to some class of tasks T, and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Let’s break this down:

T (Task): The task to be solved, such as classifying images or predicting the amount of electricity needed for purchasing.
E (Experience): The experience the model learns from. For example, training data such as images or past electricity purchases versus actual consumption.
P (Performance Measure): The metric used to evaluate performance, such as accuracy, error rate or mean squared error (MSE).

Where does the probabilistic view come in?

In classical machine learning, a value is often simply predicted:

> “The house price is 317k CHF.”

The probabilistic view, however, focuses on learning probability distributions.

Instead of generating fixed predictions, we are interested in how likely which different outcomes (in this example prices) are.

Everything that is uncertain — outputs, parameters, predictions — is treated as a random variable.

In the case of a house price, there might still be negotiation opportunities or risks that are mitigated through mechanisms like insurance.

But let’s now look at an example where it is really crucial for good decisions that the uncertainty is explicitly modelled:

Imagine an energy supplier who needs to decide today how much electricity to buy.

The uncertainty lies in the fact that energy demand depends on many factors: temperature, weather, the economic situation, industrial production, self-production through photovoltaic systems and so on. All of which are uncertain variables.

And where does probability help us now?

If we rely only on a single best estimate, we risk either:

that we have too much energy (leading to costly overproduction).
that we have too little energy (causing a supply gap).

With a probability calculation, on the other hand, we can plan that there is a 95% probability that demand will remain below 850 MWh, for example. And this, in turn, allows us to calculate the safety buffer correctly — not based on a single point prediction, but on the entire range of possible outcomes.

If we have to make an optimal decision under uncertainty, this is only possible if we explicitly model the uncertainty.

Why is this important?

Making better decisions under uncertainty:
If our model understands uncertainty, we can better weigh risks. For example, in credit scoring, a customer labelled as an ‘unsafe customer’ could trigger additional verification steps.
Increasing trust and interpretability:
For us humans, probabilities are more tangible than rigid point predictions. Probabilistic outputs help stakeholders understand not only what a model predicts, but also how confident it is in its predictions.

To understand why the probabilistic view is so powerful, we need to look at how machines actually learn (Supervised Learning, Unsupervised Learning or Reinforcement Learning). So, this is next.

Many machine learning models are deterministic — but the world is uncertain:

So, what is supervised learning?

In simple terms, Supervised Learning means that we have examples — and for each example, we know what it means.

For instance:

> If you see this picture (input x), then the flower is called Setosa (output y).

The aim is to find a rule that makes good predictions for new, unseen inputs. Typical examples of supervised learning tasks are classification or regression.

What does the probabilistic view add?

The probabilistic view reminds us that there is no absolute certainty in the real world.

In the real world, nothing is perfectly predictable.

Sometimes information is missing — this is known as epistemic uncertainty.
Sometimes the world is inherently random — this is known as aleatoric uncertainty.

Therefore, instead of working with a single ‘fixed answer’, probabilistic models work with probabilities:

> “The model is 95% certain to be a Setosa.”

This way, the model does not just guess, but also expresses how confident it is.

And what about the No Free Lunch Theorem?

In machine learning, there is no single “best method” that works for every problem.

The No Free Lunch Theorem tells us:

> If an algorithm performs particularly well on a certain type of task, it will perform worse on other types of tasks.

Why is that?

Because every algorithm makes assumptions about the world. These assumptions help in some situations — and hurt in others.

Or as George Box famously said:

> All models are wrong, but some models are useful.

Supervised learning as “glorified curve fitting”

J. Pearl describes supervised learning as ‘glorified curve fitting’.

What he meant is that supervised learning is, at its core, about connecting known points (x, y) as smoothly as possible — like drawing a clever curve through data.

In contrast, Unsupervised Learning is about making sense of the data without any labels — trying to understand the underlying structure without a predetermined target.

So, what is unsupervised learning?

Unsupervised learning means that the model receives data — but no explanations or labels.

For example:

When the model sees an image (input x), it is not told whether it is a Setosa, Versicolor or Virgnica.

The model has to find out for itself whether there are groups, patterns or structures in the data. A typical example of unsupervised learning is clustering.

The aim is therefore not to learn a fixed rule, but to better understand the hidden structure of the world.

How does the probabilistic view help us here?

We are not trying to say:

> “This picture is definitely a Setosa.”

but rather:

> “What structures or patterns are probably hidden in the data?”

Probabilistic thinking allows us to capture uncertainty and diversity in possible explanations. Instead of forcing a hard classification, we model possibilities.

Why do we need unsupervised learning?

Sometimes there are no labels for the data — or they would be very expensive or difficult to collect (e.g. medical diagnoses).

Sometimes the categories are not clearly defined (for example, when exactly an action starts and when it is finished).

Or sometimes the task of the model is to discover patterns that we do not yet recognise ourselves.

Let’s look at an example:

Imagine we have a collection of animal images — but we don’t tell the model which animal is shown.

The task is: The model should group similar animals together. Purely based on patterns it can detect.

So, and what is reinforcement learning?

Reinforcement learning means that a system learns from experience by acting and receiving feedback about whether its actions were good or bad.

In other words:

The system sees a situation (input x).
The system selects an action (a).
The system receives a reward or punishment.

In simple words, it’s actually similar to how we train a dog.

Let’s take a look at an example:

A robot is trying to learn how to walk. It tries out various movements. If the roboter falls over, it learns, that action was bad. If the robot manages a few steps, it gets a positive reward.

Behind the scenes, the robot builds a strategy or a rule called a policy π(x):

> “In situation x, choose action a.”

Initially, these rules are purely random or very bad. The robot is in the exploration phase to find out what works and what does not. Through each experience (e.g. falling or walking), the robot receives feedback (rewards) such as +1 point for standing upright, -10 points for falling over.

Over time, the robot adjusts its policy to prefer actions that lead to higher cumulative rewards. It changes its rule π(x) to make more out of good experiences and avoid bad experiences.

What is the robot’s goal?

The robot wants to find actions that bring the highest reward over time (e.g. staying upright, moving forwards).

Mathematically, the robot tries to maximise its expected future reward value.

How does the probabilistic view help us?

The system (in this example the robot) often does not know exactly which of its many actions has led to the reward. This means that it has to learn under uncertainty which strategies (policies) are good.

In reinforcement learning, we are therefore trying to learn a policy:

π(x)

This policy defines, which action should the system perform in which situation to maximize rewards over time.

Why is reinforcement learning so fascinating?

Reinforcement learning mirrors the way humans and animals learn.

It is perfect for tasks where there are no clear examples, but where improvement comes through experience.

The film AlphaGo and the breakthrough are based on reinforcement learning.

From a mathematical perspective: What are we actually learning?

When we talk about a model in machine learning, we mean more than just a function in the probabilistic view.

A model is a distributional assumption about the world.

Let’s take a look at the classical view:

A model is a function f(x)=y that translates an input into an output.

Let’s now take a look at the probabilistic view:

A model explicitly describes uncertainty — for example in f(x)=p(y∣x).

It is not about providing one “best answer”, but about modelling how likely different answers are.

In supervised learning, we learn a function that describes the conditional probability p(y|x):
The probability of a label y, given an input x.
We ask: “What is the correct answer to this input?”
Formula: f(x)=p(y∣x)
In unsupervised learning, we learn a function that describes the probability distribution p(x) of the input data:
The probability of the data itself, without explicit target values.
We ask, ‘How probable is this data itself?’.
Formula: f(x)=p(x)
In reinforcement, we learn a policy π(x) that determines the optimal action a for a state x:
A rule that suggests an action a for every possible state x, which brings as much reward as possible in the long term.
We ask: ‘Which action should be carried out now so that the system receives the best reward in the long term?
Formula: a=π(x)

On my Substack, I regularly write summaries about the published articles in the fields of Tech, Python, Data Science, Machine Learning and AI. If you’re interested, take a look or subscribe.

Final Thought — What’s the point of understanding the probabilistic view, anyway?

In the real world, almost nothing is truly certain.

Uncertainty, incomplete information and randomness characterise every decision we make.

Probabilistic machine learning helps us to deal with exactly that.

Instead of just trying to be “more accurate”, a probabilistic approach becomes:

More robust against errors and uncertainties.
For example, in a medical diagnostic system, we want a model that indicates its uncertainty (‘it is 60 % certain that it is cancer’) instead of making a fixed diagnosis. In this way, additional tests can be carried out if there is a high degree of uncertainty.
More flexible and therefore more adaptable to new situations.
For example, a model that models weather data probabilistically can react more easily to new climate conditions because it learns about uncertainties.
More comprehensible and interpretable, in that models not only give us an answer, but also how certain they are.
For example, in a credit scoring system, we can show stakeholders that the model is 90% certain that a customer is creditworthy. The remaining 10% uncertainty is explicitly communicated — this helps with transparent decisions and risk assessments.

These advantages make probabilistic models more transparent, trustworthy and interpretable systems (instead of black box algorithms).