Machine Learning From First Principles Part 1: the Bias vs Variance Trade-Off

Steven Ellis
15 min readFeb 6, 2024

Introduction to learning from data

The 33rd largest supercomputer in the world (at the time of writing this article) is the US Air Force Research Laboratory’s Condor Cluster, which has a core made of 1760 Sony PlayStation 3 consoles. It performs 500 trillion floating point operations per second, making it the fastest interactive computer in the US Defense Department. The PS3s for the Cluster’s core cost about $2 million — about 10% of the cost of an equivalent system built from off-the-shelf computer parts. The Condor Cluster is also very energy efficient, consuming just 10% of the power of comparable supercomputers.

With powerful compute power becoming ever-more accessible, it behoves the enthusiast to take a step back and ask the question: what are computers most useful for?

Well it turns out that computers are excellent at two things: solving deterministic problems and learning from data.

Wikipedia defines a deterministic algorithm as “an algorithm that, given a particular input, will always produce the same output, with the underlying machine always passing through the same sequence of states”.

Some examples of deterministic algorithmic problems are

  • classifying numbers into primes and non-primes
  • determining the time it takes for a falling object to hit the ground
  • calculating Pi to the millionth decimal

One can analytically derive a solution (or an attempt at a solution) for such problems without the need to see any data, simply by following a set of steps based on clear pre-defined laws of maths and physics.

Certain problems, however, do not have a well-defined set of steps or algorithms that can be followed to calculate an outcome or solution. Imagine any of the following:

  • Generating a movie recommendation engine, based on movies a user have previously seen
  • Determining whether to extend credit to a user who wishes to purchase a motor vehicle
  • Determining the next word in a sentence, for example: ‘The cat sat on the…’

In this set of problems, the output may not always be uniquely determined by sum of the inputs. For example, in the credit application problem, two customers could have identical salaries, outstanding loans, ages, etc., but still end up with different credit behaviours.

Such scenarios are difficult or impossible to explicitly model into an analytic solution. Nonetheless these problems are often not intractable: computers can learn from past data to come up with an empirical hypothesis, which can be used to make informed predictions against unseen data. Financial forecasting, computer vision, and large language models (LLMs) are a few examples of applications that successfully utilise learning from data in this way.

The Components of learning from data

With the learning paradigm broadly defined, let’s condense the main components of the learning problem into the following set of names and symbols:

  1. There is an input x

2. There is an unknown target function f : X → Y that we hope to approximate from learning, where

  • X is the input space (all possible inputs x)
  • Y is the output space (all possible outputs y)

3. There is a data-set D of input-output examples (x1, y1), (x2, y2) .. (xN, yN) called data points, where y = f(x) for n = 1,…N

4. Finally, there is the learning algorithm that uses the data set D to pick a formula: g: X → Y that approximates f

The learning algorithm chooses g from a set of candidate formulas, called the hypothesis set H

H could, for example, be the set of all linear functions from which the algorithm would choose the best linear fit to the data. H thus consists of multiple hypotheses h1, h2, … hM

Let’s see a practical example of these theoretical constructs in a credit approval application. The names and symbols described above for a learning algorithm that automates credit approval decisions, would be as follows:



+----------------+----------------------------------------------------------------------------------------------------------------------------------+
| Component | Example |
+----------------+----------------------------------------------------------------------------------------------------------------------------------+
| input (x) | Historical records of credit applications |
| f : X -> Y | The ideal formula for credit approval |
| Input space X | Data collected from and about the customer, as part of the credit application process (salary, outstanding debt etc.) |
| Output space Y | For credit applications, this is just a yes/no decision (whether to approve credit or not) |
| D | Inputs corresponding to previous customers, and the correct credit decision for them in hindsight |
| H | Set of candidate formulas that are trained using training examples from D |
| g: X -> Y | The final hypothesis that the bank chooses from H, which is then used to decide which credit applications to approve |
+----------------+----------------------------------------------------------------------------------------------------------------------------------+

Learning = approximating AND generalising well

Key to the concept of learning model, is that the target function f (the ideal function that determines the outcome of our non-deterministic problem) is unknown to us and can only at best be approximated by g.

Moreover, the effectiveness of g (the formula used for decision making) is only effective to the extent that it faithfully replicates f on data that is hasn’t seen yet. To achieve this g must be able to generalise well so that it approximates f when run against unseen data.

Think of it like studying for a maths exam. To prepare, you might do some exercises in your maths textbook and write a bunch of past exam papers, comparing your answers against the model answers provided. How well you have learnt the content from training data set (the textbook exercises and past papers) is determined by how well you do in the exam, when you are confronted with maths questions you haven’t yet seen. If you try to learn the answers to questions “off by heart”, you would have memorised the content but would not have learnt from the data, because you would not be able to generalise to unseen questions.

Our choice of H (which becomes our candidate hypothesis g) thus needs to strike a balance between approximating f on the training data D, and generalising on new data.

So the question then becomes: can we generalise from the learning process?

The answer is: yes, but in a probabilistic way. To help us understand why, let’s consider a simple hypothetical bin scenario.

The probability of the truth will set you free

Imagine a bin that contains a very large (perhaps infinite) collection of red and green marbles.

The probability of red marbles in this collection — given by the symbol μis unknown to us (this equates to the unknown target function f).

We pick a marble randomly N times from the bin, each time taking note the colour and replacing the marble in the bin.

We take note the fraction of red marbles within our sample — given by the symbol v (this equates to our candidate hypothesis g)

Whilst we will never ultimately know the true underlying probability distribution of red marbles in the collection, as our sample becomes large, the probability of our sample distribution v differing significantly from our population mean μ becomes very small.

For example, assume our bin mean μ is 0.9 (9 out of 10 marbles in the bin are red), and we randomly select 10 marbles. The probability of our sample having a sample mean v of 0.1 (9 of our marbles being green) is a very small number (0.1¹⁰). Observing 9 green marbles our sample of 10 is possible but not at all probable.

This is formalised in the law of large numbers where, given a sample of independent and identically distributed values, the sample mean converges to the true mean.

How comfortable we are converging from the true mean (g approximating f) dictates the sample size N needed in the learning process. If we are comfortable with a larger relative divergence (g sometimes being wrong), N can be smaller. If we want to be probabilistically very close to f (g very often being right), N must be large.

A theorem called the Hoeffding Inequality (which is based on the law of large numbers) tells us that, as the sample size N grows, it becomes exponentially unlikely that v will deviate from μ by more than some measurable tolerance for error.

If we choose our error tolerance to be very small, to achieve a good approximation of μ we need a larger sample size N. We can then confidently assert that v will be a good approximation of μ, as long as the marbles are picked randomly and independently.

This is a powerful idea because it tells us that, regardless of the size of the bin, we can probabilistically bind our calculated v to be very close to μ. The bin can be large or small and we will still get the same bound when we use the same sample size.

The bin scenario is an somewhat simplistic idiom of the learning paradigm because we have a single hypothesis (the sample mean v). In real learning, we explore an entire hypothesis set H looking for some hH that has a small error rate. (Having one hypothesis is not learning, but rather verifying). If we extend the bin model to (with one hypothesis) to the learning model (where an entire hypothesis set H), the Hoeffding Inequality still applies however, and tells us that learning from data can allow us to make a prediction outside D within a probabilistic bound. The size of this probabilistic bound depends on the complexity of our model, the complexity of the unknown target function f, and the size of the training data. We’ll see some practical examples of this a bit later.

Measuring error to guage the effectiveness of learning

We now introduce the concept of error, which allows us to measure the ability of a model to learn from data.

Learning, as we’ve mentioned, is not expected to replicate the target function perfectly — the final function g is only an approximation of f. To quantify how well g approximates f, there are two measures of error:

When training a model, the error rate within the sample (v in the bin model) is called in-sample error — Eᵢₙ(g). The in-sample error, like v, is a random variable that depends on the sample.

Any error in the performance of a single hypothesis on unseen data is called out-of-sample error — Eₒᵤₜ(g). The out-of-sample error, like μ, is unknown but not random.

In the practical learning model, Eᵢₙ is calculated using training data (from the data-set D) and Eₒᵤₜ is typically calculated using unseen test data that is put aside and then compared against the final candidate hypothesis g.

If learning is successful, g should approximate f well, meaning Eₒᵤₜ(g) ≈ 0 (the candidate hypothesis performs well against unseen test data).

Recall that Hoeffding Inequality tells us that if a learning model generalises well then Eₒᵤₜ(g) ≈ Eᵢₙ(g), because both are probabilistically bound to one another. Thus, in the process of learning from data, we aim for :

  1. Eᵢₙ(g) ≈ 0 (g approximating f well), and
  2. Eₒᵤₜ(g) as close as possible to Eᵢₙ(g) (g generalising f well to unseen data)

If we can achieve these two objectives, we can confidently conclude that Eₒᵤₜ(g) ≈ 0, and that we have learnt from data, because Our Eᵢₙ (which we can ascertain) becomes a proxy for Eₒᵤₜ (which we cannot ascertain).

The feasibility of learning can be condensed to two core questions:

  • Can we make Eᵢₙ(g) small enough ?
  • Can we make sure Eₒᵤₜ(g) is close enough to Eᵢₙ(g) ?

Effective learning comes down to a trade-off

Theory tells us that the more parameters a learning model has (the more complex it is), the more diverse its hypothesis set, and the more data is needed to achieve final hypothesis that generalises well on unseen data. This is formalised in a theorem called the VC Generalisation Bound. As models become more complex, we are likely to fit the training data better, resulting in a lower in-sample error Eᵢₙ, but one pays a penalty for this model complexity (requiring more data to learn properly) which can result in a worse performance against the test data: the out-of-sample error Eₒᵤₜ.

This is sometimes colloquially termed ‘overfitting the data’: a hypothesis set that is too complex can start to “learn the noise” — it memorises patterns of irrelevant, random information within the training data set, thereby becoming ineffective at generalising against unseen data, which has its own set of random, unpredictable noise that can obscure the true underlying target distribution.

Simpler models (with less parameters and complexity) are therefore better because a reasonable amount of training data will ensure that Eᵢₙ will be close to Eₒᵤₜ for such models. But this also represents a complexity trade-off: if H is too simple, we may fail to approximate f well and end up with a large in-sample error. If H is too complex, we may fail to generalise well, because the model complexity increases Eₒᵤₜ.

More simple models hurt Eᵢₙ. More complex models help Eᵢₙ but hurt Eₒᵤₜ. The optimal model is a compromise that minimises a combination of the two terms, striking a balance between approximating f on the training data and generalising well on new data.

Bias versus variance defined

And so, finally, we arrive at the terms bias and variance, the main theme of this post. Both are theoretical constructs, and both are used to measure the approximation-generalisation tradeoff described above.

Bias is a concept that measures how much the average function that we would learn using different data sets differs from the target function that generated those data sets. In other words, it is a measure of how much our learning model is biased away from the target function.

In theory, we have the benefit of being able to learn from an unlimited number of data sets, so the model’s average function is only limited in its ability to approximate f by the limitation of the learning model itself. Bias is thus a measure of the simplicity of a model.

Variance measures the variation in the final hypothesis set, depending on the data set, in other words, how much a single model in the hypothesis set H varies from its own average model (which is the average of all models making up the set).

One thank think of variance as a measure of the instability in the learning model. Instability manifests in wild reactions to small variations or idiosyncrasies in the data, resulting in vastly different hypotheses. Variance is thus a measure of the complexity of a model

Out-of-sample error is composed of both bias and variance. For very simple models (say with of just 1 hypothesis) variance is 0 because the final hypothesis g(D) will be the same for any data set. The bias will depend solely on how well the final hypothesis approximates the final target function f, and unless one is extremely lucky, the bias will be large.

For a large hypothesis set H, different data sets will lead to different hypotheses that agree with f on the data set, or approximate well to f. Thus bias ≈ 0 because g is likely to be close to f, but the variance is large.

An example of bias and variance

Let’s consider a target function f(x) = sin(πx) and a data set of N = 2

We sample x uniformly in [-1, 1] to generate a data set (x1,y1), (x2,y2) and fit the data to the following three models:

H₀ : Set of all lines in the form h(x) = b

H₁ : Set of all lines in the form h(x) = ax + b

H: Set of all lines in the form h(x) = ax3 + bx2 + cx + d

For each hypothesis set, we fit the 2 sets of points that attempt to best fit the target function. Repeating this process over many data sets, we then estimate the bias and variance of each hypothesis set. In the graphs below, the true target function is depicted by the black line, the average of each hypothesis set is depicted by the red line, and the thin blue lines represent output of each model iteration.

The R used to generate the plots below can be found here.

H₀ ; N = 2; bias: 0.5, variance: 0.25; both: 0.75
H₁ ; N = 27; bias: 0.23, variance: 60.9; both: 61

Using a small data set of size N = 2, we observe that with H₁, the learned hypothesis is wilder and varies extensively on the data set. For H₁, the average hypothesis (the red line) is a reasonable fit with a small bias of 0.23. However the large variability leads to a larger than expected out-of-sample error of 61. With the simpler model H₀, the fits are much less volatile, and even though we have a much higher bias for the simpler function, the out-of-sample error is much smaller than H₁. The simpler model Hwins by significantly decreasing the variance at the expense of a smaller increase in bias.

Now let’s increase the data set size to N = 5 and re-perform learning (this time also incorporating the model H₂, which is a third-degree polynomial).

H₀ ; N = 5; bias: 0.5, variance: 0.1; both: 0.6
H₁ ; N = 5; bias: 0.2, variance: 0.2; both: 0.4
H₂ ; N = 5; bias: 5.7, variance: 6990; both: 6996

Increasing the size of our data set N to 5 , we see that the simple hypothesis H₀ has the same bias as before and only slightly improved its out-of-sample error rate through a small variance reduction. This is because the simple hypothesis does not benefit much from having more data to learn from.

H₁, the linear function, does benefit from an increase in N because of the fact that the model is more complex, so learning improves with more data. The variance of the model drops significantly, resulting in a sharp improvement in the model’s out-of-sample error.

The third-degree polynomial hypothesis set H₂, however, displays pathological variance against the data set, and as a result, very high out-of-sample error and a model average that is a very poor fit of the target function. The model is too complex to be able to learn from such a small training data set: more training data is needed.

Let’s now test using a data set of size N = 10

H₀ ; N = 10; bias: 0.5, variance: 0.05; both: 0.56
H₁ ; N = 10; bias: 0.2, variance: 0.06; both: 0.26
H₂ ; N = 10; bias: 0.005, variance: 0.14; both: 0.14

Formulating the hypotheses against a data set of size N = 10, we observe that the larger training data set made no significant improvement on the H₀ out-of-sample error. The bias stays constant at 0.5.

H₁ improved its performance by slightly reducing its variance.

H₂ enjoys the biggest performance gain against this larger data set: the extra ‘degrees of freedom’ that this more complex model allows, means the hypothesis set achieves an extremely low bias and much improved variance. The more complex model therefore benefits from the larger training data set, enabling it to generalise better.

Finally let’s test using a data set of size N = 20

H₀ ; N = 20; bias: 0.5, variance: 0.02; both: 0.52
H₁ ; N = 20; bias: 0.2, variance: 0.03; both: 0.23
H₂ ; N = 20; bias: 0.005, variance: 0.02; both: 0.02

Against a data set of N = 20, H₀ is virtually unaffected, H₁ sees a slight improvement of variance, but the biggest performance gain goes to H₂, the most complex model, which uses the increased training data to learn further nuances in the target function and generalise even better.

It’s important to note that bias and variance cannot be computed in practice, since they depend on the target function, which is unknown. The bias-variance decomposition is thus a conceptual tool which is helpful when developing a model. The goal is to lower the variance without significantly increasing the bias (and vice versa). Reducing variance, for example, can be achieved through techniques such as regularisation (which will be the topic of a future post). Reducing the bias requires some knowledge of the target function to steer the hypothesis set into using the appropriate set of algorithms during learning (and which is somewhat heuristic and application-specific).

References

This post borrows heavily from the e-book ‘Learning From Data’ by Yaser S. Abu-Mostafa, Malik Magdon-Ismail and Hsuan-Tien Lin. For an in-depth and much more technical overview of these principles and the theorems that underpin them, I strongly recommend this book and the associated video lecture series!

--

--

Steven Ellis

Computer scientist, currently also pursuing a Masters in Data Science.