![](https://crypto4nerd.com/wp-content/uploads/2024/01/0lESjli8REA0lXYK6-1024x683.jpeg)
In the previous post, we explored Classification task and now we continue with our routine; Supervised Learning: Regression approach.
To recap, supervised learning is a machine learning approach in which the algorithm is trained on labeled data in order to predict unseen or new data. Kindly refer to my previous article more details.
Once you have revised the previous article, you’re ready to go! The dataset and code could be gotten from my repo.
Here’s our clustering analysis process;
- Load the data
- Feature engineering
- Split the data
- Create a Pipeline with needed classification algorithms.
- Pass the Pipeline into the created GridSearchCV
- Choose the best known estimator with its parameters
- Evaluate the regression approach using r2 score and RMSE.
Loading data
# Load the data
df = pd.read_csv("../classify/classification_dataset.csv")
df2 = pd.read_csv("../classify/classification_dataset_two.csv")
df['clusters'] = df2['clusters']
df.sample(random_state=42,n=5)
Import needed libraries for modelling
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler,FunctionTransformer,MinMaxScaler,RobustScaler
from sklearn.linear_model import Ridge,Lasso,LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV,ShuffleSplit,train_test_split
from sklearn.metrics import mean_squared_error
Data split
X = df.drop(columns=['age'])
y = df['age']
X_train,X_test, y_train,y_test = train_test_split(X,y,random_state=42,test_size=.20)
Creating feature engineering objects
def log_tranform(x):
return np.log1p(x)
columns_to_scale = df.copy() # data
standard_scaler = StandardScaler() # standard_scaler
min_max_scaler = MinMaxScaler() # min_max_scaler
robust_scaler = RobustScaler() # robust_scaler
function_transformer = FunctionTransformer(log_tranform) # functional
Creating a column transformer
column_transformer = ColumnTransformer(
transformers=[
(
'standard_scalering',standard_scaler,[1, 4, 5, 6, 9, 10]
),
(
'functional_transformer',function_transformer,[3]
)
], remainder='passthrough'
)
Why Column transformer?
This is useful tool in machine learning when dealing with datasets having different feature types, such as a combination of numerical and categorical features. It permits the use of different preprocessing steps to different subsets of features, thereby streamlining the data transformation process.
As seen in code above, I passed indexes; 1,4,5,6,9 and 10 of the dataset columns index into the standard_scaler object to achieve a feature scaling, while index 3 was passed into the function_transformer for data log transform.
Pipeline Creation
my_pipe = Pipeline(
[
("composer",column_transformer), #
# ('reg',Ridge(random_state=42))
('reg',LinearRegression(n_jobs=-2))
]
)
GridSearchCV Creation
my_params = [
{
# Column transfomer
'composer':[min_max_scaler]
# 'composer':[min_max_scaler,robust_scaler,None]
},{
# Ridge
"reg":[Ridge(random_state=42)],
# higher alpha --> higher regularistion (treats overfitting)
'reg__alpha':[0.01,0.1,0.5,1,5,10],
"reg__max_iter":[50,100,1000,1500],
},
{
# Lasso
"reg":[Lasso(random_state=42)],
# lower alpha --> higher regularistion (treats underfitting)
'reg__alpha':[0.01,0.1,0.5,1],
"reg__max_iter":[50,100,1000,1500],
"reg__warm_start":[True,False],
},
]
my_cv = ShuffleSplit(n_splits=5,test_size=.20,random_state=42)
mygrid = GridSearchCV(my_pipe,param_grid=my_params,cv=my_cv)
mygrid.fit(X_train,y_train)
A comprehensive explanation on Ridge, Lasso and Linear regression can be found in the references. But in brevity,
- Linear regression is the simplest and most classic linear model for regression analysis
- Ridge regression is also a linear model which helps in restricting the model from overfitting. This kind of regularisation used by Ridge regression is L2 regularisation.
- Lasso regression is another linear model which prevents the model from underfitting. This kind of regularisation used by Lasso regression is L1 regularisation.
Refer to this for Why Grid Search?
Once the above code is executed the below output is given;
GridSearchCV
GridSearchCV(cv=ShuffleSplit(n_splits=5, random_state=42, test_size=0.2, train_size=None),
estimator=Pipeline(steps=[('composer',
ColumnTransformer(remainder='passthrough',
transformers=[('standard_scalering',
StandardScaler(),
[1, 4,
5, 6,
9,
10]),
('functional_transformer',
FunctionTransformer(func=<function log_tranform at 0x000001A9905ED940>),
[3])])),
('reg', LinearRegression(n_jobs=-2))]),
param_grid=[{'composer': [MinMaxScaler()]},
{'reg': [Ridge(random_state=42)],
'reg__alpha': [0.01, 0.1, 0.5, 1, 5, 10],
'reg__max_iter': [50, 100, 1000, 1500]},
{'reg': [Lasso(random_state=42)],
'reg__alpha': [0.01, 0.1, 0.5, 1],
'reg__max_iter': [50, 100, 1000, 1500],
'reg__warm_start': [True, False]}])
GridSearchCV inspection
print(f"Best params: {mygrid.best_params_}n")
print(f"Best estimator: {mygrid.best_estimator_}n")
print(f"Best validation score: {mygrid.best_score_}")
Inspection output
Best params: {'composer': MinMaxScaler()}Best estimator: Pipeline(steps=[('composer', MinMaxScaler()),
('reg', LinearRegression(n_jobs=-2))])
Best validation score: 0.5631061061695223
When GridSearchCV is applied to the dataset, the best validation score is 0.56 or 56% (accuracy score), with LinearRegression being the best estimator with its hyperparameters given. Hence we can proceed to the evaluation phase using the RMSE and r2 score