![](https://crypto4nerd.com/wp-content/uploads/2023/02/1XbNTW6GrrXDnstZM9zBHPA.gif)
Quick and simple procedure
From my previous post, we build and used our linear regression model. However, we miss an important step before we can use our model. Before we can use the linear regression model to predict/interpret the results, we need to make sure that the assumptions are true.
The assumptions for linear regression are as follows:
- Linearity: The relationship between the dependent variable and the independent variable(s) is linear. This means that the change in the dependent variable is proportional to the change in the independent variable(s).
- Independence: The observations in the dataset are independent of each other. This means that the value of one observation does not depend on the value of another observation.
- Homoscedasticity: The variance of the errors (the difference between the predicted value and the actual value) is constant across all levels of the independent variables. This means that the spread of the errors should be consistent across the range of the independent variables.
- Normality: The errors are normally distributed. This means that the errors should follow a normal distribution with a mean of zero.
- No Multicollinearity: There is no perfect multicollinearity between the independent variables. This means that there is not a perfect linear relationship between any two independent variables.
The first and second assumption can’t really be tested (it depends on the theory you have and the sampling method you used), but the other three can. In this post, I’m going to show you how to test these assumptions on your linear model using RStudio.
The Data and Model
We are going to use the same data from the previous post.
The model used for this post is the multiple linear regression model with the formula
To import and build the model we need, you can refer to this post.
Homoscedasticity
To test whether our errors have constant variance or not, we can use the Breusch-Pagan test. The test has the following assumptions:
I’m not going to go into the mathematical detail of this test. To do the test in RStudio, first we need to import the library ‘lmtest’ into R. To do this, we can use the code
library(lmtest)
If you had not installed the lmtest, you need to install it first we the code:
install.packages("lmtest")
Finally, we can do the test
bptest(model2)
The results are as follows:
Notice that the p-value of this test is 0.6707, therefore we don’t have enough evidence to reject the null hypothesis.
Therefore, we can conclude that our model fulfills the homoscedasticity assumption!
Normality
If you’ve read my previous post on the normal distribution, then you should know about the shapiro wilk test.
Basically, we need to the the shapiro wilk test on the errors of our models. To get the errors of our models, we can use the following code
model2$residuals
Therefore, to test this assumptions, we only need to use this code
shapiro.test(model2$residuals)
From the results, we have a p-value of 0.8208, therefore we don’t have enough evidence to reject the hypothesis.
Therefore, we can conclude that our model fulfills the normality assumption!
Multicollinearity
To test for multicollinearity in our model, we can use the Variance Inflation Factor (VIF) value of each explanatory variable.
To do this test, we need to import the ‘car’ library to RStudio before we do the test (install the package first if you haven’t).
library(car)
Next, we use the following code to get the VIF value of each explanatory variable
vif(model2)
Now, there are no real threshold that show whether or not your model has multicollinearity or not. However, a good rule of thumb is that if your VIF values are less than 5, your model does not have multicollinearity.
This threshold is different for a lot of people, some say less than 5, but some might even tolerate less than 10.
From the results, the VIF values are less than 5, therefore we can conclude that our model fulfills this assumption.
Our model fulfills all the assumptions necessary. Therefore, we can start using our model.
Have you tested the linear regression assumptions before? are there other methods you recommend? comment them down below!
A follow and a clap is much appreciated.