Hello fellow aspiring Data Scientists,
Today, I would like to share some of the knowledge I have gained after studying Machine Learning for 28 days. I understand that many of you may be interested in building your own Machine Learning models but are unsure of the necessary steps. Don’t worry, as I will guide you through the process today with step-by-step instructions and accompanying illustrations.
Let’s get started on this exciting journey!
(Disclaimer: Please ensure that you have a good understanding of statistics, data analysis, and the required libraries for Machine Learning before proceeding.)
In simple statement, Machine Learning is an algorithm that learn from data and then making prediction.
So, what’s the difference between Artificial Intelligence and Machine Learning?
Artificial Intelligence is a broad field that refers to the development of computer systems that can perform widely tasks that would normally require human intelligence, on the other hand Machine Learning is a subset of Artificial Intelligence that focuses on development of algorithms that can learn from and make prediction or decision based on data.
Aright, let’s shall we start it.
Machine learning
Today, I will be using a dataset called ‘Churn’. This dataset provides information about the churn behavior of customers in a bank, indicating whether they have churned (left) or not. We will be focusing on a classification case, where our goal is to predict whether a customer will churn or not.
Please note that the dataset is in Excel format.
Dataset Overview:
– It consists of 10000 observations and 12 variables.
– Independent variables contain information about customers.
– Dependent variable refers to customer abandonment.
Features:
– Surname: Surname
– CreditScore: Credit score
– Geography: Country (Germany/ France/ Spain)
– Gender: Gender (Female/ Male)
– Age: Age
– Tenure: How many years of customer
– Balance: Balance
– NumOfProducts: The number of bank product used
– HasCrCard: Credit card status (0 = No, 1 = Yes)
– IsActiveMember: Active membership status (0 = No, 1 = Yes)
– EstimatedSalary: Estimated salary
– Exited: Churn or not? (0 = No, 1 = Yes)
Business Problem
You work as a data scientist at a bank. You have been tasked with predicting whether a prospective customer will churn (stop using the bank’s services).
Problem:
- Determine whether a customer will churn (stop using the bank’s services).
Metrics:
False Positive (FP):
- The ML model predicts that a prospective customer will churn, but in reality, they do not churn.
False Negative (FN):
- The ML model predicts that a prospective customer will not churn, but in reality, they do churn.
I will focus on the FN metric because I want to reduce the occurrence of falsely predicting that a customer will not churn when they actually do. So, we are going to use f2 matric.
Note: f2 = Recall (FN) is more important than precision (FP), but both should still be considered.
On this first step, we make sure we imported all the library that we will need for our machine learning and don’t forget to create new jupyter notebook.
# Library
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import missingno
import sklearn# Data Split
from sklearn.model_selection import train_test_split,cross_val_score,GridSearchCV,cross_validate,RandomizedSearchCV,StratifiedKFold
# Preporcessing|
from sklearn.compose import ColumnTransformer
# from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline
# Resampling
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler, NearMiss
# Encoding
from sklearn.preprocessing import OneHotEncoder
from category_encoders import OrdinalEncoder, BinaryEncoder
# function untuk scaling
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import KBinsDiscretizer
# function untuk impute missing values
from sklearn.impute import SimpleImputer # mean, median, most_frequent (mode), constant
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer # regresi
from sklearn.impute import KNNImputer # regresi KKN
# modeling
from sklearn.tree import DecisionTreeRegressor,DecisionTreeClassifier
from sklearn.linear_model import LinearRegression , Ridge, Lasso, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
# Metric
from sklearn.metrics import mean_squared_error,accuracy_score
from sklearn.metrics import precision_score,recall_score
from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay
from sklearn.metrics import precision_recall_curve, roc_curve, PrecisionRecallDisplay, RocCurveDisplay, auc
from sklearn.metrics import make_scorer, mean_absolute_error, mean_squared_error, mean_absolute_percentage_error,fbeta_score
# Ensemble various type (modeling)
from sklearn.ensemble import VotingClassifier, StackingClassifier
# Ensemble similiar type(modeling)
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier # Bagging
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier # Boosting
from xgboost.sklearn import XGBClassifier
import lightgbm as lgb
# Save Model
import pickle
On above is all the library for machine learning is needed, but i will used not all of them.
We are going to read the dataset first and then brief understanding about the dataset, whether the dataset has duplicates nor has a missing value. We must avoid two of these things because our machine learning won’t work.
We are going using ‘pd.read_csv()’ for reading our dataset and then save it on new variable and after that we grab only 5 sample of row using function ‘.head()’.
Then, we are going insert new code again on new cell with function ‘.info()’, this code is used to print a concise summary of a Dataframe.
In the previous statement, we observed that our dataset doesn’t contain any missing values. However, we are still uncertain about the presence of duplicates in the data. To check for duplicates, we will use the function ‘.duplicated().sum()’.
Ahh, upon examining our dataset, we can observe that it does not contain any duplicate records. Additionally, it is important to determine the number of rows in our dataset. We can accomplish this by utilizing the ‘.shape()’ function.
Aight, based on the analysis of our dataset, we have determined that it contains 10,000 rows and 14 features or columns.
In the third step, we will conduct a quick EDA (Exploratory Data Analysis). This is crucial because understanding the presence of outliers is essential. If we neglect to handle outliers in machine learning, it can potentially result in overfitting. Overfitting happens when a model performs exceptionally well on the training data but fails to generalize well to new, unseen data. Therefore, it is necessary to gain a comprehensive understanding of the data in machine learning.
We will initiate this step by generating a description of the dataset to obtain a quick understanding. Here’s the code:
pd.set_option('display.max_colwidth', None)
listItem = []
for col in df.columns:
unique_values = df[col].nunique()
if unique_values >= 20:
listItem.append([col, df[col].dtype, df[col].isna().sum(), round((df[col].isna().sum()/len(df[col])) * 100,2),
unique_values, list(df[col].drop_duplicates().sample(20).values)]);
else:
listItem.append([col, df[col].dtype, df[col].isna().sum(), round((df[col].isna().sum()/len(df[col])) * 100,2),
unique_values, list(df[col].drop_duplicates().values)]);dfDesc = pd.DataFrame(columns=['dataFeatures', 'dataType', 'null', 'nullPct', 'unique', 'uniqueSample'],
data=listItem)
dfDesc
okay, next we also want to check the proportion of the dataset. To do that we simply count the value of the target and then divide the length of the target like this:
df['Exited'].value_counts() / len(df['Exited'])
Our proportion data is imbalanced, and this can lead to overfitting, which is not desirable. To address this issue, we will use a method called ‘resampling’.
Remove Unused Feature
There are 3 columns that I want to delete, namely: RowNumber, Surname, and Gender. Here are the reasons:
- RowNumber: This column only indicates the row number in a specific dataset, so it will not help in prediction.
- Surname: This column is redundant because it can be represented by CustomerId, and it contains duplicate values as different CustomerId can have the same surname.
- Gender: This column only indicates the gender of the customer, and in this case, predicting whether a customer will churn or not does not require the gender column as I don’t want to discriminate based on gender.
df.drop(labels=['RowNumber','Surname','Gender'],axis=1,inplace=True)
df.shape
After that, we want to check for outliers. To do that, we will utilize the graphical method using a box plot and a for loop. First, we need to define a new variable to hold the index names for both categorical and numerical data. Then, we will loop through each name in the index. Here’s the code:
# Feture Numeric
display(df.describe().columns)
Numeric_feat = list(df.describe().columns)
Numeric_feat.remove('Exited')
# Feture Categorical
display(df.describe(include='object').columns)
Category_feat = list(df.describe(include='object').columns)
Numeric Loop
# melihat data outlayers pada setiap feature (kolom numerikal)
plt.figure(figsize=(15, 9))
plotnumber = 1for feature in Numeric_feat:
ax = plt.subplot(4,3, plotnumber)
sns.boxplot(x=feature, data=df);
plt.title(feature, fontsize=16)
plt.tight_layout()
plotnumber += 1
Categorical Loop
plt.figure(figsize=(14, 10))
plotnumber = 1for feature in Category_feat:
# subplot (letak grafik)
ax = plt.subplot(4, 3, plotnumber)
# Harga rumah berdasarkan feature
sns.countplot(x = feature, data = df, hue='Exited')
plt.xlabel('Exited')
plt.title(feature, fontsize=16)
plt.tight_layout();
plotnumber += 1
There are 3 columns that have outliers, namely: CreditScore, Age, and NumOfProducts. I will check how many outlier data points are present in each of these columns.
df.drop(df[df['CreditScore']<383].index, inplace=True)
df.drop(df[df['Age']>85].index, inplace=True)
After reviewing the data, I conclude that there is 1 column with outliers that I will tolerate, which is “NumOfProducts,” as there are 60 rows which I consider to be significant and useful. The remaining outliers will be removed since they are only a few in number.
After we are done dealing with the outlier and cleaning our dataset. Now we need define the feature (X) and the target (y).
X = df.drop(columns='Exited')
y = df['Exited']
Now we need split the data set become two subset one for train and test.
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42,
stratify=y
)
Here, we need to insert the X and y variables, followed by setting the test size to 0.2. This means we want to allocate 80% of our dataset for training and 20% for testing. Additionally, I have defined a random state of 42 to ensure consistent random selection of data for the train set and test set each time this code is executed. Lastly, we set ‘stratify’ to y as a parameter to ensure that the proportions of subsets in the train set and test set remain the same.
Note: Before proceeding to the preprocessing phase, it is crucial to split the data first to avoid data leakage. Data leakage occurs when information from the test set or unseen data inadvertently influences the training process, leading to overly optimistic performance results. Therefore, it is essential to split the data into separate training and testing sets before conducting any machine learning analysis.
Now on this step we are entering Preprocessing phase, we are going preparing (cleaning and organizing) the raw data to make it suitable for a building and training Machine Learning.
First of all, we need remove some feature that doesn’t need through the pipeline.
Numeric_feat.remove('HasCrCard')
Numeric_feat.remove('IsActiveMember')
Numeric_feat
These feature above are only had 2 unique value that are 0 and 1. Next we are going to define a variable for transformer like this:
transformer = ColumnTransformer([
('Numeric',RobustScaler(),Numeric_feat),
('Category',OneHotEncoder(),Category_feat)
],remainder='passthrough')
Just as a friendly reminder, in our data preprocessing step, we will be applying the RobustScaler to all numeric features. Since the data does not follow a normal distribution, the RobustScaler is an appropriate choice for handling such non-normal data distributions.
For the categorical features, we will be using the OneHotEncoder. This encoding technique is suitable for categorical variables with a relatively small number of unique values.
Now we are going Cross Validation to find which algorithm that match with our dataset today. First, we need define Scoring, Resampling and Algorithms.
#Scoring
f2 = make_scorer(fbeta_score,beta=2)#Resampling
random_under = RandomUnderSampler(random_state=42)
random_over = RandomOverSampler(random_state=42)
smote = SMOTE(random_state=42,k_neighbors=5)
nearmiss = NearMiss(version=1)
#algo
logreg = LogisticRegression(random_state=42, max_iter=10000)
knn = KNeighborsClassifier(n_neighbors=5)
tree = DecisionTreeClassifier(random_state=42)
# voting (ensemble)
voting = VotingClassifier([
('clf1', logreg),
('clf2', knn),
('clf3', tree)
])
# stacking (ensemble)
stacking = StackingClassifier(
estimators=[
('clf1', logreg),
('clf2', knn),
('clf3', tree)
],
final_estimator= logreg
)
bagging = BaggingClassifier(random_state=42)
random_forest = RandomForestClassifier(random_state=42)
adaboost = AdaBoostClassifier(random_state=42)
gboost = GradientBoostingClassifier(random_state=42)
xgboost = XGBClassifier(random_state=42)
lgbm = lgb.LGBMClassifier(random_state=42)
List_Resamlping = [None,random_under,random_over,smote,nearmiss]
List_Algo = [logreg,knn,tree,voting,stacking,bagging,random_forest,adaboost,gboost,xgboost,lgbm]
After that we need inserting some code for looping for cross val like this:
cv_resample = []
cv_mean = []
cv_std = []
cv_all = []
cv_algo = []for resample in List_Resamlping:
for Algo in List_Algo:
model_pipe = Pipeline([
('prep',transformer),
('resampling',resample),
('model', Algo)])
model_cv = cross_val_score(
estimator = model_pipe,
X = X_train,
y = y_train,
cv = 5,
scoring = f2,
)
cv_resample.append(resample)
cv_algo.append(Algo)
cv_mean.append(model_cv.mean())
cv_std.append(model_cv.std())
cv_all.append(model_cv.round(4))
I created an empty variable where the result will be put inside. After that, I also created an empty dataframe where the result will be displayed inside the dataframe.
df_cv = pd.DataFrame({
'algo': cv_algo,
'resamp': cv_resample,
'mean': cv_mean,
'std': cv_std,
'all': cv_all
})df_cv.sort_values('mean',ascending=False).head()
After running the code mentioned above, we observed that the Gradient Boosting model, along with the RandomUnderSampler technique for resampling, was selected as the best model for our dataset.
Okay, now we already have which algorithm that suit with our dataset, we just need do an hyperparameter tunning for this algorithm. To do that we are going to utilize RandomSearchCV.
hyperparam_gboost = {
'model__n_estimators': range(50,1001,20),
'model__max_features': range(1,11),
'model__max_depth': range(2,30,2),
'model__learning_rate': np.arange(0.01, 1.00, 0.02),
'model__min_samples_split': range(2,100,5),
'model__min_samples_leaf': range(2,100,5),
'prep__Numeric':[RobustScaler(),StandardScaler(),MinMaxScaler()],
'prep__Category':[OneHotEncoder(),BinaryEncoder()]
}model_pipe_gboost = Pipeline([
('prep',transformer),
('resampling',RandomUnderSampler(random_state=42)),
('model', gboost)])
randomsearch_gboost = RandomizedSearchCV(
estimator= model_pipe_gboost,
param_distributions= hyperparam_gboost,
cv= 5,
scoring=f2,
n_jobs=-1,
n_iter= 500,
random_state= 42
)
Here’s a quick recap of the code:
In the hyperparam_gboost
section, we define the hyperparameter space where we will try different combinations to find the best configuration for our model. For example, we specify the algorithm’s parameters by using the model
keyword, followed by double underscores to indicate the parameter’s name and its corresponding value. The same principle applies to the preprocessing steps, where we use the prep
keyword, followed by the specific preprocessing function and its corresponding parameters.
Moving on to the second code snippet, model_pipe_gboost
represents the pipeline that our raw data will go through. It includes preprocessing steps and the gradient boosting model.
Lastly, the RandomizedSearchCV
function is where the magic happens. This function performs an exhaustive search over the defined hyperparameter space and evaluates each combination using cross-validation and the specified scoring metric. It explores various parameter settings to find the optimal configuration for our model.
- Estimator: This parameter represents the estimator or the model pipeline we want to optimize. In our case, it is denoted as
model_pipe_gboost
. - Param_distributions: Here, we specify the hyperparameter space to be explored during the randomized search. It can be a dictionary or a list of dictionaries. Each dictionary contains the hyperparameters and their respective values that we want to search. In our code,
hyperparam_gboost
represents this hyperparameter space. - CV: This parameter stands for cross-validation, which determines the strategy for splitting the data into train and validation sets during the search. In our case,
cv=5
indicates that 5-fold cross-validation will be used. - Scoring: Here, we define the scoring metric to evaluate the performance of different parameter combinations. In our code,
f2
represents the scoring metric. Make suref2
is a valid scoring function or a predefined metric. - N_jobs: This parameter determines the number of parallel jobs to run during the randomized search. Setting
n_jobs=-1
means that the computation will be parallelized across all available CPU cores. - N_iter: This parameter specifies the number of parameter settings that are sampled during the randomized search. In our code,
n_iter=500
means that 500 parameter settings will be randomly sampled. - Random_state: This parameter sets the random seed to ensure reproducibility of the randomized search. By setting
random_state=42
, we guarantee that the same random samples will be generated each time we run the code. - Random_state: This parameter sets the random seed to ensure reproducibility of the randomized search. By setting
random_state=42
, we guarantee that the same random samples will be generated each time we run the code.
Now, we will fit our selected model.
randomsearch_gboost.fit(X_train, y_train)
Here, I will display the best parameters, best estimator, and best score. It’s important to note that during the tuning process, there is no guarantee that the accuracy will always improve. In some cases, the default parameters may provide better results than the tuned ones.
Now, we are going to predict to our test set, Here’s the result:
We can observe that our model performed better before tuning compared to after tuning. This discrepancy in performance could be attributed to the issue of data imbalance. To address this issue, we can find the best threshold.
Here, we will be iterating over the thresholds in order to find the best threshold value:
list_threshold = np.arange(0.01, 1.00, 0.01)
list_recall_score = []
list_precision_score = []
list_f2_score = []list_i = []
for threshold in list_threshold:
# predict
y_pred_proba = model_pipe_gboost_best.predict_proba(X_test)[:, 1] # prediction in probability form
y_pred_class = np.where(y_pred_proba > threshold, 1, 0) # assign class 1 to predictions with probability above threshold
# f2 score
list_f2_score.append(fbeta_score(y_test, y_pred_class, beta=2))
list_precision_score.append(precision_score(y_test, y_pred_class, zero_division=1))
list_recall_score.append(recall_score(y_test, y_pred_class))
In this code, we create a list of thresholds ranging from 0.01 to 0.99 with a step of 0.01. Then, we iterate over each threshold and perform the following steps:
- We predict the probabilities of the positive class using the
predict_proba
method of ourmodel_pipe_gboost_best
model. - Based on the threshold, we convert the predicted probabilities into binary class predictions. If the probability is above the threshold, we assign the class label 1; otherwise, we assign the class label 0.
- We calculate the F2 score, precision score, and recall score for each threshold and append them to the respective lists.
This code allows us to evaluate the model’s performance at different threshold values and determine the best threshold to address the data imbalance issue. Then display which best threshold on Dataframe.
df_f2 = pd.DataFrame()
df_f2['threshold']=list_threshold
df_f2['f2']=list_f2_scoredf_f2.sort_values('f2',ascending=False).head()
The best threshold for our model has been determined to be 0.43. Now, we will proceed to retrain our model using this threshold. To extract the corresponding best value from the list, we can use the ‘.iloc()’ function.
best_threshold = df_f2.sort_values('f2',ascending=False).head().iloc[0,0]
# Modeling dengan Best Threshold# pipeline
pipe_model = pipe_model = randomsearch_gboost.best_estimator_
# fit
pipe_model.fit(X_train, y_train)
# predict
y_pred_proba = pipe_model.predict_proba(X_test)[:, 1]
y_pred_optimized = np.where(y_pred_proba > best_threshold, 1, 0)
# f1
fbeta_score(y_test, y_pred_optimized, beta=2)
Now, the accuracy increasing quite significantly after we chose the right value of the threshold. Next, we can display the confusion matrix and classification report to further evaluate the performance of our model.
display('Before',confusion_matrix(y_test,y_pred_benchmark),'After',confusion_matrix(y_test,y_pred_best),'Best Threshold',confusion_matrix(y_test,y_pred_optimized))
print('Before Tuned')
print(classification_report(y_test, y_pred_benchmark))
print('After Tuned')
print(classification_report(y_test, y_pred_best))
print('After Tuned with Best Threshold')
print(classification_report(y_test, y_pred_optimized))
Here’s the key takeaway from the confusion matrix and classification report:
We can observe a trade-off between false negatives (predicting a customer will not churn when they actually do) and false positives (predicting a customer will churn when they actually don’t). Since we are using the F2 score, our focus is primarily on recall rather than precision, although precision is still considered.
Furthermore, our initial model was able to correctly predict that 83% of customers would not churn and predict 71% of customers who would churn. However, after refining our model and finding the appropriate threshold value, the percentage of correctly predicting customers who will not churn decreased to 76%, while the predict of customers who will churn increased to 80%.
This demonstrates the impact of selecting the right threshold and the resulting improvement in the model’s performance.
That concludes today’s guidelines on creating a machine learning model. However, it’s important to emphasize that we must still carefully consider the best parameters and thresholds based on our specific business problem.
Every machine learning project is unique, and the optimal parameters and thresholds may vary depending on the nature of the problem, the dataset, and the desired outcome. It’s crucial to fine-tune our models, experiment with different parameter combinations, and evaluate their performance based on the specific business problem we are trying to solve.
By continuously iterating, refining, and evaluating our models, we can strive to achieve the best possible results and make informed decisions to address the specific challenges of our business problem.
Thank you.