Body Performance Project-2.6. Supervised Learning: Regression… | by Daniel Chiebuka Ihenacho

Supervised Learning: Regression approach

In the previous post, we explored Classification task and now we continue with our routine; Supervised Learning: Regression approach.

To recap, supervised learning is a machine learning approach in which the algorithm is trained on labeled data in order to predict unseen or new data. Kindly refer to my previous article more details.

Once you have revised the previous article, you’re ready to go! The dataset and code could be gotten from my repo.

Here’s our clustering analysis process;

Load the data
Feature engineering
Split the data
Create a Pipeline with needed classification algorithms.
Pass the Pipeline into the created GridSearchCV
Choose the best known estimator with its parameters
Evaluate the regression approach using r2 score and RMSE.

Loading data

# Load the data
df = pd.read_csv("../classify/classification_dataset.csv")
df2 = pd.read_csv("../classify/classification_dataset_two.csv")
df['clusters'] = df2['clusters']
df.sample(random_state=42,n=5)

Import needed libraries for modelling

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler,FunctionTransformer,MinMaxScaler,RobustScaler
from sklearn.linear_model import Ridge,Lasso,LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import  GridSearchCV,ShuffleSplit,train_test_split
from sklearn.metrics import mean_squared_error

Data split

X = df.drop(columns=['age'])
y = df['age']
X_train,X_test, y_train,y_test = train_test_split(X,y,random_state=42,test_size=.20)

Creating feature engineering objects

def log_tranform(x):
return np.log1p(x)
columns_to_scale = df.copy() # data
standard_scaler = StandardScaler() # standard_scaler 
min_max_scaler = MinMaxScaler() # min_max_scaler 
robust_scaler = RobustScaler() # robust_scaler
function_transformer = FunctionTransformer(log_tranform) # functional

Creating a column transformer

column_transformer = ColumnTransformer(
transformers=[
(
'standard_scalering',standard_scaler,[1, 4, 5, 6, 9, 10]
),
(
'functional_transformer',function_transformer,[3]
)
], remainder='passthrough'
)

Why Column transformer?

This is useful tool in machine learning when dealing with datasets having different feature types, such as a combination of numerical and categorical features. It permits the use of different preprocessing steps to different subsets of features, thereby streamlining the data transformation process.

As seen in code above, I passed indexes; 1,4,5,6,9 and 10 of the dataset columns index into the standard_scaler object to achieve a feature scaling, while index 3 was passed into the function_transformer for data log transform.

Pipeline Creation

my_pipe = Pipeline(
[
("composer",column_transformer), # 
# ('reg',Ridge(random_state=42))
('reg',LinearRegression(n_jobs=-2))
]
)

GridSearchCV Creation

my_params = [
{
# Column transfomer
'composer':[min_max_scaler]
# 'composer':[min_max_scaler,robust_scaler,None]
},{
# Ridge
"reg":[Ridge(random_state=42)],
# higher alpha --> higher regularistion (treats overfitting)
'reg__alpha':[0.01,0.1,0.5,1,5,10], 
"reg__max_iter":[50,100,1000,1500],
},
{
# Lasso
"reg":[Lasso(random_state=42)],
# lower alpha --> higher regularistion (treats underfitting)
'reg__alpha':[0.01,0.1,0.5,1], 
"reg__max_iter":[50,100,1000,1500],
"reg__warm_start":[True,False],
},
]
my_cv = ShuffleSplit(n_splits=5,test_size=.20,random_state=42)
mygrid = GridSearchCV(my_pipe,param_grid=my_params,cv=my_cv)
mygrid.fit(X_train,y_train)

A comprehensive explanation on Ridge, Lasso and Linear regression can be found in the references. But in brevity,

Linear regression is the simplest and most classic linear model for regression analysis
Ridge regression is also a linear model which helps in restricting the model from overfitting. This kind of regularisation used by Ridge regression is L2 regularisation.
Lasso regression is another linear model which prevents the model from underfitting. This kind of regularisation used by Lasso regression is L1 regularisation.

Refer to this for Why Grid Search?

Once the above code is executed the below output is given;

GridSearchCV

GridSearchCV(cv=ShuffleSplit(n_splits=5, random_state=42, test_size=0.2, train_size=None),
estimator=Pipeline(steps=[('composer',
ColumnTransformer(remainder='passthrough',
transformers=[('standard_scalering',
StandardScaler(),
[1, 4,
5, 6,
9,
10]),
('functional_transformer',
FunctionTransformer(func=<function log_tranform at 0x000001A9905ED940>),
[3])])),
('reg', LinearRegression(n_jobs=-2))]),
param_grid=[{'composer': [MinMaxScaler()]},
{'reg': [Ridge(random_state=42)],
'reg__alpha': [0.01, 0.1, 0.5, 1, 5, 10],
'reg__max_iter': [50, 100, 1000, 1500]},
{'reg': [Lasso(random_state=42)],
'reg__alpha': [0.01, 0.1, 0.5, 1],
'reg__max_iter': [50, 100, 1000, 1500],
'reg__warm_start': [True, False]}])

GridSearchCV inspection

print(f"Best params: {mygrid.best_params_}n")
print(f"Best estimator: {mygrid.best_estimator_}n")
print(f"Best validation score: {mygrid.best_score_}")

Inspection output

Best params: {'composer': MinMaxScaler()}Best estimator: Pipeline(steps=[('composer', MinMaxScaler()),
('reg', LinearRegression(n_jobs=-2))])
Best validation score: 0.5631061061695223

When GridSearchCV is applied to the dataset, the best validation score is 0.56 or 56% (accuracy score), with LinearRegression being the best estimator with its hyperparameters given. Hence we can proceed to the evaluation phase using the RMSE and r2 score

Source link