![](https://crypto4nerd.com/wp-content/uploads/2023/02/1Gh1FHM3eSiw0qJpxTf0ylQ-1024x302.png)
Why Regularization?
What is machine learning all about….if I need to explain this in layman’s language, maybe I will say — we make a model, and we tune the model in such a way that when we ask a new question(or feed data), we get the desired answer(categorical or numerical).
Simple, right!!
So, what is that one biggest challenge model making faces?
Is it data cleaning or exploratory data analysis or missing data? Well all that are mandatory steps which need to be followed. Some dataset may have these challenges, while other may not have it at all! But there is one challenge which every model has to face — to make a model which has low bias and low variance. In other words, model shouldn’t be overfitted or underfitted.
There are few models, where the accuracy for training data is low and for testing data is also low. It neither predicts the training data correctly not predicts the test data correctly. In layman’s language — it is the uneducated model or undertrained model or bad model model. Here the bias will be high(training error is high) and variance will also be high(testing error is high). The graph somewhat looks like this:
We see here, this line is actually good for nothing. — A probable case of underfitting.
Now, overfitting! Here the model literally hugs the training data or memorizes the training data like 100%. But when we pass testing data, the model says — “boss….don’t ask questions which are out of syllabus!!!” Hahaha well this is true. Let me show you how the graph may look:
We can see the training data, completely on the model. This is why the bias is low as the training error is low. While variance is very high as the testing data show high error, low accuracy.
This data is sheer case of overfitting.
Both the situation is not desirable. We try to find a mid point, which is the best fit point, where the bias is low and variance is also low. Accuracy for both training and testing data is more or less same and is high.
All said and explained, what regularization has to do with all this?
To tune the model, to get to the best-model point, to stop the case or overfitting or underfitting, we use regularization.
One more very interesting observation is there :
Underfitting happens mostly in regression problem — if no regularization is done.
Overfitting happens mostly in decision tree — if no regularization is done.
So, what are the regularization techniques in regression and decision tree models?
For Regression — L1 and L2 :: We talk about coefficients. By looking at the coefficients, we can say if the model is overfitted or not.
For Decision Tree — Bagging, Booting, ensemble techniques et all.
Let’s focus on the regression problem and the L1 and L2
So, L2 is nothing but — Ridge Regression
While, L1 is — Lasso Regression.
Both are mostly same with a small difference which make a huge impact.
L2 Regularization — Ridge Regression
L2 = Ridge Regression tries to change the coefficients to get the best-fit model. So, to change the coefficients, we have our loss formulae already.
Loss = Sum of(yi — yi-hat)²
Ridge regression is just a small modification of this loss function.
L2, Ridge = sum of(yi — yi-hat)² + lambda*slope²
L2, Ridge = sum of(yi — yi-hat)² + lambda*(m1²+m2²+m3²+…+mn²)
Now, in the above equation, lambda is fixed. Talking about m1, m2, or mn, if it is very high, loss will also be high. So, we have to tune the coefficient in such a way that the value is not very high. So, ridge regression tries to keep the coefficient within a limit and tries to shrink the value of these coefficients.
No coefficient can never be ZERO in ridge regression.
L1 Regularization — Lasso Regression
L1,Lasso = sum of(yi — yi-hat)² + lambda*|slope|
Here, also the coefficient is kept in control. But just the slope’s modulus is taken, some my be positive, some may be negative, all are added.
When we are working on high dimensional dataset, we can tune the lambda value in such a way that one or more coefficient can be ZERO in lasso regression .
We know that both Lasso and Ridge regression we are trying to shrink the coefficient value. So, in Lasso, if we increase the lambda value, those coefficient whose value is negligible or small, becomes zero. This doesn’t happen in Ridge Regression.
Those coefficient which are not impacting the model, or in other words, which are not important, are removed from the model — well well, that’s feature selection!!! It reduces the dimension. Wow!! That’s great isn’t it!! So, lasso is doing feature selection or dimensionality reduction.
This is the biggest difference between the Lasso and Ridge regression, Lasso regression not only regularizes the coefficients, but also helps in feature selection.
Just a theoretical look at the L1 and L2 regularisation technique.
Happy Learning!!!