It would be nice if there were a single global solution to the bias-variance tradeoff and researchers could all agree to use the same set of optimal analytical approaches.
Why quantify bias and variance explicitly, rather than treating prediction error as a single sum?
Data scientists typically have some control over the bias of a model, and can still indirectly influence the total error as well.
To see this intuitively, suppose we are tasked with developing a statistical model that can predict a person’s Extraversion score based on demographic variables, such as age, gender etc.
Then suppose that instead of taking the exercise seriously, we make the same prediction for every human being on the planet,asserting, by fiat, that every human on the planet has exactly 15 Units.
The bias of this estimator is likely to be high, as it is exceedingly unlikely that the true mean Extraversion level across all potential samples is 15.
The estimator has no variance at all, since we always make exactly the same prediction, no matter what our data look like.
100% of the model’s expected total error can be attributed to bias.
This is the fundamental tradeoff between bias and variance.
Other things being equal, when we increase the bias of an estimator, we decrease its variance, because by biasing our estimator to preferentially search one part of the parameter space, we simultaneously inhibit its ability to explore other, non-preferred points in the space.
Whether this trade is helpful or harmful will depend entirely on the context.
The bias variance tradeoff offers an intuitive way of understanding what is at stake in the ongoing debate over p-hacking.
(Upcoming Block).
One can construct a research strategy that favors liberal, flexible data analysis as a relatively low bias but high variance approach and an approach that favors strict adherence to a fixed set of procedures as a high bias and low-variance…