Demystifying Polynomial Regression: Understanding and Implementation | by Viswa

In the field of machine learning and statistical modeling, regression analysis plays a vital role in predicting and understanding relationships between variables. Linear regression, a popular regression technique, assumes a linear relationship between the dependent variable and one or more independent variables.

However, in many real-world scenarios, the relationship between variables may not be strictly linear. This is where polynomial linear regression comes into play, allowing for more flexible modeling by incorporating polynomial terms. In this article, we will delve into the concept of polynomial linear regression.

Introduction
Polynomial Linear Regression
Performing Polynomial Regression in Python
Advantages of Polynomial Linear Regression
Disadvantages of Polynomial Linear Regression
Conclusion

Polynomial linear regression is an extension of simple linear regression that allows for fitting polynomial functions to the data. In polynomial regression, the relationship between the independent variable (x) and the dependent variable (y) is modeled using a polynomial equation of degree ’n’. The degree represents the highest power of the independent variable in the equation.

The general form of a polynomial regression equation is:

y = β₀ + β₁x + β₂x² + β₃x³ + … + βₙxⁿ

Here, y represents the dependent variable, x is the independent variable and β₀, β₁, β₂, …, βₙ are the coefficients.

Data Preparation: As with any regression analysis, it is crucial to preprocess and clean the data before applying polynomial regression. This involves handling missing values, outlier detection, and feature scaling if required.
Polynomial Feature Transformation: To incorporate polynomial terms, the original independent variable (x) is transformed by adding new columns that represent x raised to different powers (x², x³, etc.). This is typically done using libraries or functions provided by machine learning frameworks like scikit-learn.
Degree Selection: Selecting the appropriate degree for the polynomial equation is crucial. A degree that is too low may result in underfitting, whereas an excessively high degree can lead to overfitting.
Model Evaluation: Similar to linear regression, model evaluation metrics such as mean squared error (MSE), R-squared, or adjusted R-squared can be used to assess the performance of the polynomial regression model.

This is the section where you’ll find out how to perform the polynomial regression in Python.

We will analyze data from a combined cycle power plant to attempt to build a predictive model for output power.

Step 1: Importing Python Libraries

The first step is to start your Jupyter notebook and load all the prerequisite libraries in your Jupyter notebook. Here are the important libraries that we will be needing for this linear regression.

NumPy (to perform certain mathematical operations)
pandas (to store the data in a pandas Data Frames)
matplotlib.pyplot (you will use matplotlib to plot the data)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Step 2: Loading the Dataset

Let us now import data into a DataFrame. A DataFrame is a data type in Python. The simplest way to understand it would be that it stores all your data in tabular format.

df = pd.read_csv('Data[1].csv')
df.head()
X = df.iloc[:,:-1].values
y = df.iloc[:,-1].values

Step 3 : Splitting the dataset into the Training and Test set

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=42)

This line imports the function train_test_split from the sklearn.model_selection module. This module provides various methods for splitting data into subsets for model training, evaluation and validation.

Here, X and y represent your input features and corresponding target values, respectively. The test_size parameter specifies the proportion of the data that should be allocated for testing. In this case, test_size=0.25 means that 25% of the data will be used for testing, while the remaining 75% will be used for training.

The random_state parameter is an optional argument that allows you to set a seed value for the random number generator. By providing a specific random_state value (e.g., random_state=42), you ensure that the data is split in a reproducible manner

The train_test_split function returns four separate arrays: X_train, X_test, y_train, and y_test. X_train and y_train represent the training data, while X_test and y_test represent the testing data.

Step 4 : Training the Polynomial Linear Regression model on the Training set

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
poly_reg = PolynomialFeatures(degree=4)
X_poly = poly_reg.fit_transform(X_train)
regressor = LinearRegression()
regressor.fit(X_poly,y_train)

The first two lines of code import the necessary modules, PolynomialFeatures from sklearn.preprocessing and LinearRegression from sklearn.linear_model .

Next, you create an instance of the PolynomialFeatures class and assign it to the variable poly_reg. The degree parameter is set to 4, indicating that you want to generate polynomial features up to degree 4. This means that for each feature in X_train, the PolynomialFeatures class will create new features that are powers of the original feature up to degree 4.

The fit_transform() method of poly_reg is then applied to X_train. This method fits the PolynomialFeatures transformer to the training data and transforms the original features into polynomial features. The result is stored in X_poly, which now contains the original features as well as the polynomial features.

Here, you create an instance of the LinearRegression class and assign it to the variable regressor. The fit() method is then called on regressor, with X_poly (the polynomial features) as the input features and y_train as the corresponding target values. This step trains the linear regression model on the polynomial features.

During the training process, the linear regression model will learn the optimal coefficients (slope and intercept) that minimize the difference between the predicted values and the actual target values in the training data. By using polynomial features, the model can capture non-linear relationships between the features and the target variable.

Once the fit() method completes, the linear regression model (regressor) will have learned from the training data and be ready to make predictions.

Step 5 : Predicting the Test set results

y_pred = regressor.predict(poly_reg.transform(X_test))
np.set_printoptions(precision=2)
print(y_pred)

This line of code uses the predict() method of the trained model to generate predictions for the test data X_test. The predict() method takes the input features (X_test) as an argument and returns the predicted values for the target variable (y_pred).

Step 6 : Evaluating the Model Performance

from sklearn.metrics import r2_score
r2_score(y_pred,y_test)

This code imports the r2_score function from scikit-learn’s metrics module. The r2_score function is commonly used as an evaluation metric for regression models, including linear regression. It measures the proportion of the variance in the target variable that is predictable from the input features.

A higher R-squared score indicates a better fit of the regression model to the data, where 1 represents a perfect fit and 0 represents no relationship between the predicted and actual values.

An R-squared score of 0.939 for the regressor indicates that approximately 93.9% of the variance in the target variable can be explained by the polynomial regression model’s predictions. This suggests a very good fit of the model to the data.

Link to the code in github:

Flexibility: Polynomial regression allows for modeling complex relationships that cannot be adequately captured by simple linear regression. It can capture nonlinear patterns and curvature in the data.
Improved Fit: By incorporating polynomial terms, polynomial regression can provide a better fit to the data, resulting in improved predictive performance.
Interpretability: Polynomial regression equations can still be interpreted similarly to linear regression, allowing for insights into the impact of the independent variables on the dependent variable.

Overfitting: As the degree of the polynomial increases, the model becomes more flexible and can fit the training data more closely.
Complexity: The complexity of polynomial linear regression increases with the degree of the polynomial. When dealing with higher-degree polynomials, the number of parameters to estimate grows rapidly, leading to a more complex model.
Increased computational requirements: Polynomial linear regression with high-degree polynomials requires more computational resources and time to train compared to simple linear regression.
Sensitivity to outliers: Polynomial linear regression can be sensitive to outliers in the data.

Polynomial linear regression offers a flexible approach to modeling relationships between variables by incorporating polynomial terms. It is very good in catching nonlinear patterns which makes it in gaining insights from complex datasets. However, it is essential to carefully select the degree of the polynomial equation to avoid overfitting or underfitting. By understanding and utilizing polynomial linear regression effectively, make more accurate predictions.

Source link