Unraveling the Enigma: Do Bias and Variance Always Compete in Machine Learning? | by Salman Sigari

In the constantly evolving realm of machine learning and statistical modeling, a pervasive challenge lies in striking a delicate balance between bias and variance. These two components are the essential sources of error that can prevent supervised learning algorithms from extrapolating beyond their training datasets accurately.
Consider a financial institution trying to predict the risk of loan default based on customer characteristics. A high-bias model might oversimplify the patterns, considering only a few factors, such as the applicant’s income. This oversimplification can lead to underfitting, where the model fails to capture critical patterns in the data and subsequently fails to predict accurately. Conversely, a high-variance model may become excessively complex, including even insignificant details, leading to overfitting. The model performs excellently on the training data but falters when faced with new, unseen data.
These are everyday manifestations of the bias-variance tradeoff in industry-specific machine learning applications. However, the question remains: Are we always destined to juggle between bias and variance when designing machine learning models? Or, are there ways to escape or at least mitigate this seeming inevitability? This article uses a Python case study to dive deeper into this intriguing phenomenon, focusing on a practical application within the financial sector. Understanding Bias and Variance
Bias is the error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss relevant relations between features and target outputs (underfitting). Variance, on the other hand, is the error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).

Imagine a tightrope walker performing a thrilling high-wire act. The act involves walking across a thin wire strung high above the ground, maintaining a delicate equilibrium to avoid falling. Analogously, the bias-variance tradeoff in machine learning represents a similar high-stakes balancing act. It’s all about adding just the right level of complexity to our model — not too simple to miss out on critical insights, and not too intricate to fit the noise and lose the real signal.
Consider our earlier example from the financial industry, predicting loan defaults. If we have a high-bias model that primarily considers income as the determinant of loan default risk, we will inevitably miss out on the nuanced understanding derived from other factors such as credit history, loan amount, and employment type. By making our model more sophisticated to include these factors, we decrease this bias but, in turn, may increase variance. If we continue adding complexity, we might start fitting our model to specific instances of the training data, rather than to the overall pattern. This is known as overfitting, and while it may lead to exceptional performance on training data, it will likely fail to generalize and perform poorly on unseen data.
However, the advent of advanced techniques has brought a ray of hope to this perennial problem. By leveraging modern methods like regularization, ensemble methods, and advanced neural networks, we can possibly diminish the impact of this tradeoff, enabling the creation of more accurate and reliable predictive models. The question is, how effective are these techniques? Can we completely avoid this tradeoff? Our Python case study will investigate these pressing questions, demonstrating the practical implications in the financial industry.

We will illustrate the bias-variance tradeoff using synthetic data that represents fictitious bank loan applications. Our target variable will be a binary value indicating whether or not the loan was defaulted on.
We will start by importing the necessary libraries and generating some synthetic data, representing the ‘income’, ‘credit_score’, and ‘loan_amount’, along with the target variable ‘default’:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error, accuracy_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline# Set random seed for reproducibility
np.random.seed(0)
# Generate synthetic data
n_samples = 1000
X = np.random.normal(size=(n_samples, 3))  # income, credit_score, loan_amount
true_fun = lambda X: np.round(1 / (1 + np.exp(-(X[:, 0] - X[:, 1] + 2 * X[:, 2]))))  # true function
y = true_fun(X) + np.random.binomial(1, 0.1, n_samples)  # add some noise
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Now, we will create three models: a high-bias model considering only income, a more balanced model considering income and credit score, and a high-variance model considering interaction terms up to the 3rd degree:

# High bias model (Income only)
model1 = LogisticRegression()
model1.fit(X_train[:, 0].reshape(-1, 1), y_train)# More balanced model (Income and Credit Score)
model2 = LogisticRegression()
model2.fit(X_train[:, :2], y_train)
# High variance model (Interaction terms up to 3rd degree)
model3 = make_pipeline(PolynomialFeatures(3), LogisticRegression())
model3.fit(X_train, y_train)

Each model has a different level of complexity and therefore different bias and variance characteristics. We can evaluate these models using accuracy and mean squared error (MSE) on both the training and test datasets:

# Evaluate models
models = [model1, model2, model3]
names = ['High Bias', 'Balanced', 'High Variance']for model, name in zip(models, names):
if name == 'High Bias':
train_preds = model.predict(X_train[:, 0].reshape(-1, 1))
test_preds = model.predict(X_test[:, 0].reshape(-1, 1))
elif name == 'Balanced':
train_preds = model.predict(X_train[:, :2])
test_preds = model.predict(X_test[:, :2])
else:
train_preds = model.predict(X_train)
test_preds = model.predict(X_test)
print(f"{name} Model: Train MSE = {mean_squared_error(y_train, train_preds):.2f}, Test MSE = {mean_squared_error(y_test, test_preds):.2f}")
print(f"{name} Model: Train Accuracy = {accuracy_score(y_train, train_preds):.2f}, Test Accuracy = {accuracy_score(y_test, test_preds):.2f}n")

High Bias Model: Train MSE = 0.45, Test MSE = 0.51
High Bias Model: Train Accuracy = 0.61, Test Accuracy = 0.55Balanced Model: Train MSE = 0.39, Test MSE = 0.40
Balanced Model: Train Accuracy = 0.65, Test Accuracy = 0.65
High Variance Model: Train MSE = 0.08, Test MSE = 0.06
High Variance Model: Train Accuracy = 0.92, Test Accuracy = 0.94

The first plot titled “Mean Squared Error” showcases the Mean Squared Error (MSE) for each of our models, both on the training and testing sets. Mean Squared Error is a measure of how well the model fits the data, with lower values being better.

The second plot titled “Accuracy” presents the accuracy of the models on the training and testing datasets. Accuracy is a classification metric, measuring the proportion of correct predictions made by the model. Higher accuracy values indicate better performance.

Now, let’s analyze the plots:

Mean Squared Error (MSE)
The High Bias model has an MSE of 0.45 for the training set and 0.51 for the test set, indicating a significant amount of error and an inability to fit the data precisely. The Balanced model reduces the error to 0.39 on the training set and 0.40 on the test set, which shows an improvement in fitting the data more closely and generalizing well. The High Variance model demonstrates a substantially reduced MSE of 0.08 on the training set and an even lower 0.06 on the test set, indicating a very precise fit to the data.
Accuracy
The High Bias model has an accuracy of 0.61 on the training set and 0.55 on the test set, suggesting it struggles to make accurate predictions and to generalize well. The Balanced model’s accuracy improves to 0.65 on both the training and test sets, which is a positive sign as it shows that the model generalizes well and doesn’t overfit. The High Variance model reaches an accuracy of 0.92 on the training set and 0.94 on the test set, which are outstanding values. However, the higher accuracy on the test set compared to the training set can be an indicator of slight overfitting.

Overall, these results underscore the bias-variance tradeoff: the High Bias model underfits the data, the High Variance model slightly overfits it, and the Balanced model seems to strike a decent balance between bias and variance.

High Bias Model

A high-bias model oversimplifies the problem by only considering income to predict loan default. This oversimplification leads to a high error rate on both the training and test sets (Mean Squared Error, or MSE, of 0.45 and 0.51 respectively). The accuracy is also relatively low at 0.61 on the training set and 0.55 on the test set. This model doesn’t perform well because it has a “biased” view of the problem and fails to capture important patterns in the data.

Balanced Model

The balanced model performs better by considering both income and credit score. It has a lower error rate (MSE of 0.39 on training and 0.40 on the test) and higher accuracy (0.65 on both sets) compared to the high-bias model. This is because it captures more relevant patterns in the data by considering an additional feature.

High Variance Model

The high variance model performs the best on both the training and test datasets, having the lowest error rates (MSE of 0.08 and 0.06) and the highest accuracy (0.92 and 0.94). This model is more complex, considering interaction terms up to the 3rd degree, and is able to capture more subtle patterns in the data. However, there’s a cautionary tale here. Although this model performs best on the given data, its high complexity makes it susceptible to overfitting. Overfitting occurs when a model learns the noise along with the signal in the training data, reducing its ability to generalize to new, unseen data.

Deciphering the Bias-Variance Tango

In essence, our findings underscore the delicate dance between bias and variance in machine learning models. High-bias models may underestimate the complexity of the problem, resulting in suboptimal performance due to oversimplification. Conversely, high-variance models, in their quest to accommodate intricate patterns, risk learning the noise rather than the signal, leading to overfitting.

Striking a balance between these two extremes, the balanced model presents a promising alternative. It outperforms the high-bias model while also sidestepping the overfitting pitfalls associated with the high-variance model. However, the selection of an ideal model should take into account these tradeoffs in conjunction with the unique demands and constraints of the problem at hand.

Stepping Beyond the Tradeoff

With contemporary methodologies such as regularization, boosting, and bagging, we can potentially soften the blow of the bias-variance tradeoff. These techniques amalgamate several models or impose penalties on over-complex models, thereby curtailing overfitting and potentially minimizing both bias and variance concurrently.

In the simplest terms, while bias and variance traditionally represent a tradeoff, innovative techniques can help us navigate or even circumvent this constraint. Consequently, the response to the question, “Is there always a tradeoff between bias and variance?” is a resounding “Not necessarily.” By meticulously managing our models, leveraging advanced techniques, and tuning hyperparameters with precision, we can effectively tame bias and variance, crafting models that aptly generalize to unseen data.

Source link