UNDERSTANDING INDUCTIVE BIAS IN MACHINE LEARNING: A GUIDE | by Kunal Dargan

In the machine learning paradigm, the common objective is to generalize over unseen data, i.e. low error rate over test data “related-to” training data is desirable. (Mitchel,1980).

Given any datasets, there can be more than one model that fits the training data well; the question is how to choose a model ?

For example, selecting any 2 points on the cartesian plane can find one unique line that passes through both of them. If there is no restriction over the hypothesis function to be linear, then there can be an infinite number of higher-order polynomials which can pass through those points.

Inductive Bias refers to the assumptions made ‘a priori’ to model about the relationship between inputs and outputs, which helps choose one form of generalization over another.

The constraints put over the hypothesis space are called inductive bias.

Multiple hypothesis functions showing fit over the datasets, namely sin function in blue,ridge regression in green and average in red. Data generated using sinusoidal function.

Researchers have shown that the inductive bias poses a restriction over the choice of hypothesis space which is to be learned, this reflects the prior knowledge about the task in hand. This is important for practical learning to take place.

Domain experts based on their prior knowledge of task suggests model which may fit the data well. For example, a bio-chemist would understand the nature of biological process and knows data produced has linear relationship hence suggests linear regression model.
More can be learnt by understanding a concept called ‘Causality’.

There are two kinds of inductive bias: Strong (Restrictive) and Weak (Preference).

Strong (Restrictive)

Limiting the search space of hypotheses, for example, given a dataset and independent assumption over features to classify, say, fruits apple, banana, and orange. We can assume independence over features’ weight, shape, and color hence selecting a reasonable hypothesis. (e.g., naïve Bayes)

Weak (Preference)

A tree model assumes that the relationship between inputs and outputs can be represented as a tree-like structure i.e., Impose ordering on hypothesis space (e.g., decision tree)

The following is a list of common inductive biases in machine learning algorithms.

Occam’s razor: A simpler function f, is preferred over complex function.
Prefer the simplest hypothesis consistent with data, In case of too much noise in data, a simple linear relationship in the form of average might be best fit.

Principle of Minimum Description Length (Occam’s Razor Extension): The best hypothesis is the one that produces the shortest hypothesis and the description of the exceptions to that hypothesis.

Data locally connected and translation invariant : Convolutional neural networks

Maximum conditional independence: A naive independence assumption over feature set of classes. This is the primary assumption for Naive Bayes classifier.

Linearity Assumption: Assumption over output variable being linearly dependent of input yields linear models such as Linear Regression

Maximum Margin between Support Vectors and Decision boundary : SVM Classifier

Minimum cross-validation error: Derived from “No free lunch theorem” we prefer a model with least possible cross validation error. This kind of model tends to generalize well.

Minimum features: Select those features which add significant variance to the prediction, if 2 features are correlated they may be removed.

Nearest neighbors: KNN (a lazy learning algorithm, simply stores data points during training) finds the ‘K’ Nearest neighbors to a query points in the space and classifies the new data point based on most similar points.

There are many such examples of inductive biases are found in Deep Learning also, regularization is also known to be an outcome of them.

Relational Inductive Biases in Deep Learning: Attention models have taken the field by storm where almost all the new relevant work which is coming contains some form of attention module. Transformers (Ashish Vaswani et al in famous Attention is all you need paper) points out that “At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.”

Relationships is the central idea behind the attention module where it tries to find long range relations between sequential inputs. Earlier LSTM and GRU models were benchmark for nlp datasets and task but they couldn’t as effectively capture long range dependencies.

In this blog post, we have discussed the concept of inductive bias and its impact on model performance. These assumptions can significantly impact a model’s ability to generalize from the training data to new, unseen data. In order to model’s ability to generalize, inductive biases, and initial knowledge lie at the heart of the issue. Literature over the past several years, has found these guiding principles (knowledge, biases, and observation) helpful in designing new algorithms. Hence domain expertise of the human reviewer is as essential as the data itself in real-world use cases.

– IIT KGP — NPTEL course : Machine Learning- Sudeshna Sarkar
– Prof Mausam’s Machine Learning course slides (IITD) https://www.cse.iitd.ac.in/~mausam/courses/csl333/spring2015/lectures/20-mlintro.pdf
– Battaglia, Peter W., et al. “Relational inductive biases, deep learning, and graph networks.” arXiv preprint arXiv:1806.01261 (2018).

Source link