![](https://crypto4nerd.com/wp-content/uploads/2023/09/0XW-g7y8q-HgxhOv5.jpg)
Linear regression is a type of statistical analysis used to predict the relationship between two variables. It assumes a linear relationship between the one or more independent variable(x) and the dependent(y) variable, and aims to find the best-fitting line that describes the relationship. The line is determined by minimizing the sum of the squared differences between the predicted values and the actual values.
The variable you want to predict is called the dependent variable. The variable you are using to predict the other variable’s value is called the independent variable. It is a Boundary-Based Approach. Comes under supervised learning and performs regression tasks.
Linear regression is commonly used in many fields, including economics, finance, and social sciences, to analyze and predict trends in data.
In a simple linear regression, there is one independent variable and one dependent variable. The model estimates the slope and intercept of the line of best fit, which represents the relationship between the variables. The slope represents the change in the dependent variable for each unit change in the independent variable, while the intercept represents the predicted value of the dependent variable when the independent variable is zero.
Linear regression shows the linear relationship between the independent(predictor) variable i.e. X-axis and the dependent(output) variable i.e. Y-axis. If there is a single input variable X(independent variable), such linear regression is called simple linear regression.
Multiple Linear Regression assumes there is a linear relationship between two or more independent variables and one dependent variable.
The Formula for multiple linear regression:
Y=B0+B0X1+B2X2+……+BnXn+e
- Y = the predicted value of the dependent variable
- B0 = the y-intercept (value of y when all other parameters are set to 0)
- B1X1= the regression coefficient (B1) of the first independent variable (X1)
- BnXn = the regression coefficient of the last independent variable
- e = residual
The multiple linear regression model can be represented as a plane (in 2-dimensions) or a hyperplane (in higher dimensions).
Random Error(Residuals): In regression, the difference between the observed value of the dependent variable(yi) and the predicted value(predicted) is called the residuals. Residual measures how far away a point is from the regression line.
εi = y(predicted )— yi
where y(predicted) = B0 + B1 Xi
- In a residual analysis, residuals are used to assess the validity of a statistical or ML model. The model is considered a good fit if the residuals are randomly distributed. If there are patterns in the residuals, then the model is not accurately capturing the relationship between the variables. It may need to be improved, or another model may need to be selected.
Residual Plot:
- A residual plot is a scatterplot in which X-axis represents the independent or target variable, and Y-axis represents residual values based on the ML model.
- A residual plot is used to identify the underlying patterns in the residual values. We can assess the ML model’s validity based on the observed patterns.
- The model is a good fit if the residuals are randomly distributed. If there are patterns in the residuals, then the model is not accurately capturing the relationships between the variables.
Based on patterns observed in residual values, there are several types of residual plots, as mentioned below :
Random Pattern
- In this category of residual plots, residual values are randomly distributed, and there is no visible pattern in the values. In this case, the developed ML model is considered a good fit.
U-Shaped Pattern
- In this category, the residual plot follows a U-shaped curve, as mentioned in the above figure. In this case, the model is not considered a good fit, and a non-linear model might be required.
Before evaluating the linear regression models using residual plot analysis, let’s first understand three basic assumptions of linear regression models regarding residuals.
Independence Identical Distribution(IID)
- The linear regression model assumes that residuals or error terms are independent and that no visible pattern exists. It means that their pairwise covariance is zero.
- If the error terms are not independent, then the uniqueness of the least square’s solution is lost, and the model is not considered a good fit.
Normality
- In this assumption, it is assumed that residuals are normally distributed. If the residuals are not normally distributed, then it implies that the model is not able to explain the relationships among the features in the data.
Homoscedasticity
- It is called the constant variance assumption. In this assumption, it is assumed that the error term or residual is constant across values of the target variable. It means that it follows the same variance across the target variable’s values.
There exists infinite number of lines that fits the data. Out of infinite possibilities we need to find a line with minimum error. The best fit line is a line that fits the given scatter plot in the best way. Mathematically, the best fit line is obtained by minimizing the Residual Sum of Squares(RSS).
This helps to work out the optimal values for B0 and B1, which provides the best fit line for the data points.
In Linear Regression, generally Mean Squared Error (MSE) cost function is used, which is the average of squared error that occurred between the y(predicted) and yi.
We calculate MSE using simple linear equation y = mx+b
Using the MSE function, we’ll update the values of B0 and B1 such that the MSE value settles at the minima. These parameters can be determined using the gradient descent method such that the value for the cost function is minimum.
The main reason for squaring the values of difference is to keep the values of distance from the mean to be positive.
For example, We have a data set of x coordinates as [-2, -1, 0, 1, 2]. The Mean here is 0. So if we take the sum distance of the points from the mean. Then we will have negative and positive values that will cancel out each other and the sum of the distances will be zero, which is wrong. Hence to avoid this problem we take the square of differences.
- One of the major reasons is, x² is differentiable, while |x| is not differentiable at x=0.
- Minimizing squared error is not the same as minimizing absolute error.
The reason of minimizing squared error prevents large errors is better. - The absolute error give equal weights to all errors whether it is larger or smaller. Further, it ignore the direction of error that is whether the error is positive or negative.
- In square mean error, also, by squaring the errors we are ignoring the effect of sign. Further, note that by squaring the errors, the contribution of larger errors to sum of errors become more compared to small errors. Thus, a model which gives more closer values is automatically selected by the approach of mean square errors.
- Absolute value is a discontinuous function and Square value is a continuous function.
Gradient Descent : Open this link
Regression is a parametric approach. ‘Parametric’ means it makes assumptions about data for the purpose of analysis. Due to its parametric side, regression is restrictive in nature. It fails to deliver good results with data sets which doesn’t fulfill its assumptions. Therefore, for a successful regression analysis, it’s essential to validate these assumptions.
So, how would you check (validate) if a data set follows all regression assumptions? You check it using the regression plots (explained below) along with some statistical test.
Assumptions of linear regression include:
- Linearity: The relationship between the dependent and independent variables is linear.
- Independence: The observations are independent of each other.
- Homoscedasticity: The variance of the errors is constant across all levels of the independent variables.
- Normality: The errors follow a normal distribution.
- No multicollinearity: The independent variables are not highly correlated with each other.
- No endogeneity: There is no relationship between the errors and the independent variables.
- Autocorrelation: There should be no correlation between the residual (error) terms. Absence of this phenomenon is known as Autocorrelation.
- Overfitting: When more and more variables are added to a model, the model may become far too complex and usually ends up memorizing all the data points in the training set. This phenomenon is known as the overfitting of a model. This usually leads to high training accuracy and very low test accuracy.
- Multicollinearity: It is the phenomenon where a model with several independent variables, may have some variables interrelated.
- Feature Selection: With more variables present, selecting the optimal set of predictors from the pool of given features (many of which might be redundant) becomes an important task for building a relevant and better model.
As multicollinearity makes it difficult to find out which variable is actually contributing towards the prediction of the response variable, it leads one to conclude incorrectly, the effects of a variable on the target variable. Though it does not affect the precision of the predictions, it is essential to properly detect and deal with the multicollinearity present in the model, as random removal of any of these correlated variables from the model causes the coefficient values to swing wildly and even change signs.
Multicollinearity can be detected using the following methods.
- Pairwise Correlations: Checking the pairwise correlations between different pairs of independent variables can throw useful insights in detecting multicollinearity.
- Variance Inflation Factor (VIF): Pairwise correlations may not always be useful as it is possible that just one variable might not be able to completely explain some other variable but some of the variables combined could be ready to do this. Thus, to check these sorts of relations between variables, one can use VIF. VIF basically explains the relationship of one independent variable with all the other independent variables. VIF is given by,
where i refers to the ith variable which is being represented as a linear combination of the rest of the independent variables.
The common heuristic followed for the VIF values is if VIF > 10 then the value is definitely high and it should be dropped. And if the VIF=5 then it may be valid but should be inspected first. If VIF < 5, then it is considered a good VIF value.
There have always been situations where a model performs well on training data but not on the test data. While training models on a dataset, overfitting, and underfitting are the most common problems faced by people.
Bias:
Bias is a measure to determine how accurate is the model likely to be on future unseen data. Complex models, assuming there is enough training data available, can do predictions accurately. Whereas the models that are too naive, are very likely to perform badly with respect to predictions. Simply, Bias is errors made by training data.
Generally, linear algorithms have a high bias which makes them fast to learn and easier to understand but in general, are less flexible. Implying lower predictive performance on complex problems that fail to meet the expected outcomes.
Variance:
Variance is the sensitivity of the model towards training data, that is it quantifies how much the model will react when input data is changed.
Ideally, the model shouldn’t change too much from one training dataset to the next training data, which will mean that the algorithm is good at picking out the hidden underlying patterns between the inputs and the output variables.
Ideally, a model should have lower variance which means that the model doesn’t change drastically after changing the training data(it is generalizable). Having higher variance will make a model change drastically even on a small change in the training dataset.
Let’s understand what is a bias-variance tradeoff is.
The aim of any supervised machine learning algorithm is to achieve low bias and low variance as it is more robust. So that the algorithm should achieve better performance.
There is no escape from the relationship between bias and variance in machine learning.
There is an inverse relationship between bias and variance,
- An increase in bias will decrease the variance.
- An increase in the variance will decrease the bias.
There is a trade-off that plays between these two concepts and the algorithms must find a balance between bias and variance.
As a matter of fact, one cannot calculate the real bias and variance error terms because we do not know the actual underlying target function.
Now coming to the overfitting and underfitting.
When a model learns each and every pattern and noise in the data to such extent that it affects the performance of the model on the unseen future dataset, it is referred to as overfitting. The model fits the data so well that it interprets noise as patterns in the data.
When a model has low bias and higher variance it ends up memorizing the data and causing overfitting. Overfitting causes the model to become specific rather than generic. This usually leads to high training accuracy and very low test accuracy.
Detecting overfitting is useful, but it doesn’t solve the actual problem. There are several ways to prevent overfitting, which are stated below:
- Cross-validation
- If the training data is too small to train add more relevant and clean data.
- If the training data is too large, do some feature selection and remove unnecessary features.
- Regularization
Underfitting is not often discussed as often as overfitting is discussed. When the model fails to learn from the training dataset and is also not able to generalize the test dataset, is referred to as underfitting. This type of problem can be very easily detected by the performance metrics.
When a model has high bias and low variance it ends up not generalizing the data and causing underfitting. It is unable to find the hidden underlying patterns from the data. This usually leads to low training accuracy and very low test accuracy. The ways to prevent underfitting are stated below,
- Increase the model complexity
- Increase the number of features in the training data
- Remove noise from the data.
- What are the parameters of a linear regression?
Linear regression has two main parameters: slope (weight) and intercept. The slope represents the change in the dependent variable for a unit change in the independent variable. The intercept is the value of the dependent variable when the independent variable is zero. The goal is to find the best-fitting line that minimizes the difference between predicted and actual values.
- What is the formula for linear regression line?
The formula for a linear regression line is:
y = mx + b
Where y is the dependent variable, x is the independent variable, m is the slope (weight), and b is the intercept. It represents the best-fitting straight line that describes the relationship between the variables by minimizing the sum of squared differences between actual and predicted values.