Testing the Assumptions of Linear Regression in RStudio | by Insufficient

Quick and simple procedure

From my previous post, we build and used our linear regression model. However, we miss an important step before we can use our model. Before we can use the linear regression model to predict/interpret the results, we need to make sure that the assumptions are true.

The assumptions for linear regression are as follows:

Linearity: The relationship between the dependent variable and the independent variable(s) is linear. This means that the change in the dependent variable is proportional to the change in the independent variable(s).
Independence: The observations in the dataset are independent of each other. This means that the value of one observation does not depend on the value of another observation.
Homoscedasticity: The variance of the errors (the difference between the predicted value and the actual value) is constant across all levels of the independent variables. This means that the spread of the errors should be consistent across the range of the independent variables.
Normality: The errors are normally distributed. This means that the errors should follow a normal distribution with a mean of zero.
No Multicollinearity: There is no perfect multicollinearity between the independent variables. This means that there is not a perfect linear relationship between any two independent variables.

The first and second assumption can’t really be tested (it depends on the theory you have and the sampling method you used), but the other three can. In this post, I’m going to show you how to test these assumptions on your linear model using RStudio.

The Data and Model

We are going to use the same data from the previous post.

The model used for this post is the multiple linear regression model with the formula

Multiple Linear Regression has more than one explanatory variable

To import and build the model we need, you can refer to this post.

Homoscedasticity

To test whether our errors have constant variance or not, we can use the Breusch-Pagan test. The test has the following assumptions:

The Hypothesis for the Breusch Pagan Test

I’m not going to go into the mathematical detail of this test. To do the test in RStudio, first we need to import the library ‘lmtest’ into R. To do this, we can use the code

library(lmtest)

If you had not installed the lmtest, you need to install it first we the code:

install.packages("lmtest")

Finally, we can do the test

bptest(model2)

The results are as follows:

Results of the Breusch-Pagan Test in RStudio

Notice that the p-value of this test is 0.6707, therefore we don’t have enough evidence to reject the null hypothesis.

Therefore, we can conclude that our model fulfills the homoscedasticity assumption!

Normality

If you’ve read my previous post on the normal distribution, then you should know about the shapiro wilk test.

Basically, we need to the the shapiro wilk test on the errors of our models. To get the errors of our models, we can use the following code

model2$residuals

Therefore, to test this assumptions, we only need to use this code

shapiro.test(model2$residuals)

From the results, we have a p-value of 0.8208, therefore we don’t have enough evidence to reject the hypothesis.

Therefore, we can conclude that our model fulfills the normality assumption!

Multicollinearity

To test for multicollinearity in our model, we can use the Variance Inflation Factor (VIF) value of each explanatory variable.

To do this test, we need to import the ‘car’ library to RStudio before we do the test (install the package first if you haven’t).

library(car)

Next, we use the following code to get the VIF value of each explanatory variable

vif(model2)

The VIF value of each variable

Now, there are no real threshold that show whether or not your model has multicollinearity or not. However, a good rule of thumb is that if your VIF values are less than 5, your model does not have multicollinearity.

This threshold is different for a lot of people, some say less than 5, but some might even tolerate less than 10.

From the results, the VIF values are less than 5, therefore we can conclude that our model fulfills this assumption.

Our model fulfills all the assumptions necessary. Therefore, we can start using our model.

Have you tested the linear regression assumptions before? are there other methods you recommend? comment them down below!

A follow and a clap is much appreciated.

Source link

Leave a Reply Cancel reply

Related Stories

Different types of artificial intelligence (AI) | by Robert Ishimura Sousa | Apr, 2024

VC-Dimension V.S. Inductive Bias V.S. Biology V.S. Physical Laws : Comprehensive Multi-Disciplinary Table of Machine Learning Classifiers | by Medium_AI_CS_ML | Apr, 2024

Why Machine Learning Is Worth Talking About? | by jupytermishra | Apr, 2024

You may have missed

The Weekly Reorg: Bitcoin Fashion Week

Virtual curating frees artist – Hypergrid Business

Different types of artificial intelligence (AI) | by Robert Ishimura Sousa | Apr, 2024

Azteco Is Helping Millions Buy Bitcoin Without Sharing Their Identity