Finding the Right Balance Between Bias and Variance in Machine Learning | by Andrea D’Agostino

In this article, I’ll explore another key concept in our battle against over/underfitting: the bias-variance trade-off.

One of the main goals of machine learning is to build models that can generalize well on data that has not yet been seen.

When building a model, it is important to ensure that it is not too simple that it cannot capture the intricacies of the problem or too complex that it overlaps with the training data.

In this article, we’ll discuss the trade-off between bias and variance in machine learning and how finding the right balance between the two can help improve model performance.

Let’s start by defining what bias and variance are:

Bias: represents the model prediction error due to its assumptions. For example, a linear model might have high bias if the problem is inherently non-linear.

Bias means prejudice. It represents how much the model is inclined to interpret the data in a certain way before seeing the data itself.

Variance: Represents the model’s prediction error due to its sensitivity to the training data. For example, a model that fits very well on training data might have high variance if it doesn’t generalize well on unseen data.

Model variance indicates how well the model fits the training data. A model with high variance fits very well on the training data but generalizes poorly on unseen data.

For example, let’s say you have a regression model that predicts the price of homes based on several factors like size, location, number of bedrooms, etc.

If the model has high variance, this means that it fits the training data too tightly and may predict wrong prices for homes that are not in the training set.

On the other hand, if the model has low variance, it may not be able to capture the complexities of the problem and consequently generalize poorly on new data.

The main goal of machine learning is to reduce the model generalization error, which is the error that occurs when the model is applied to unseen data. To do that, you need to strike the right balance between bias and variance.

Choosing the right model depends on the needs of the problem. Here is a list of useful ideas to think about in order to understand how to balance our model.

Assess model complexity: Simple models such as linear regressions have high bias and low variance, while complex models such as neural networks have low bias and high variance.
Dataset size: By increasing the dataset size, you can reduce the model variance. Indeed, with more training data, the model will have more information to generalize and reduce the fit to the training data.
Regularization: Regularization is a technique used to control the complexity of the model. For example, L1 and L2 regularization can help reduce the model variance.
Cross-validation: Cross-validation is a technique used to evaluate model performance on unseen data. This helps to avoid overfitting and strike the right balance between bias and variance.
Feature selection: Feature selection is another technique used to control model complexity. Removing irrelevant or redundant features can help reduce model variance.

It is important to remember that there is no single perfect model for all machine learning problems. It is necessary to carefully evaluate the specific needs of the problem and choose the most suitable model for that context.

For this reason the analyst should iterate between different models, through a phase called model selection, and select the best performing one for the given problem.

Also, it is important to always keep in mind that balancing bias and variance is a continuous and dynamic process. Model performance may vary over time and as new information becomes available. Therefore, it is necessary to constantly monitor the performance of the model and make any adjustments when necessary.

Let’s see how models perform across various levels of bias and variance. Models typically underfit, overfit, or are balanced.

Given a fixed dummy dataset, the models in the graphs show how underfitting (high bias, low variance), overfitting (low bias, high variance) look, respectively, and how these two are balanced.

Our goal as analysts is to find the right balance such that unseen data is modeled with somewhat low error compared to the ground truth (truth of the observable world).

This short article joins the following collection of articles that have interpretability and generalization as their central topic:

See you soon!
Andrew

Source link