
The Challenge: The focus here was on understanding relationships in data and predicting outcomes using both tree-based and non-tree-based models.
After setting up my Python environment with essential libraries, I create a dataset, which essentially models a linear relationship between x
and y
.
import numpy as np
N = 51
b0 = 0
b1 = 2
x = np.arange(0, N, 1)
y = b0 + b1*x + np.random.normal(0, 5, N)
2. Tree-based Model Building:
Used the powerful RandomForest and GradientBoosting regressors for this.
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
tree_models = [RandomForestRegressor(random_state=100), GradientBoostingRegressor()]
3. Non-Tree-based Model Building:
Linear regression and its regularized versions (Ridge, Lasso, ElasticNet) and SVR were used.
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.svm import SVR
non_tree_models = [LinearRegression(), Ridge(), Lasso(), ElasticNet(), SVR()]
4. Predictions for x=100:
Both tree-based and non-tree-based models were trained on the data and then used to predict the outcome for x=100.
x_new = [[100]]
predictions = []
for model in tree_models:
model.fit(train_df[['x_values']], train_df['y_values'])
predictions.append(model.predict(x_new)[0])for model in non_tree_models:
model.fit(train_df[['x_values']], train_df['y_values'])
predictions.append(model.predict(x_new)[0])
5. Visualization (Optional):
Using matplotlib, predictions from all models were visualized across the range x=1 to x=100.
x_plot = np.arange(1, 101, 1)
models = [rf_model,
gb_model,
LR_model,
ridge_model,
lasso_model,
elastic_model,
SVR_model]
model_names = ['Random Forest',
'Gradient Boosting',
'Logistic Regression',
'Ridge',
'Lasso',
'ElasticNet',
'SGDOneClassSVM']# A for loop to predict and plot for each model
for model, model_name in zip(models, model_names):
model_plot = model.predict(x_plot.reshape(-1, 1))
plt.plot(x_plot, model_plot, label=model_name)
plt.scatter(case_df['x_values'],
case_df['y_values'],
color='black')
plt.legend()
plt.show()
The Challenge: The task was to create a classifier to detect banking fraud.
1. Data Creation:
Using the make_classification
function from sklearn, a dataset with a class imbalance (99% to 1%) was created.
## Imports
import numpy as np
import pandas as pd
# Pre-processing
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# Model selection
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.model_selection import train_test_split
# Model building
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier################################### Question Data / 2 ###########################################
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=10000, weights=(0.99, 0.01), random_state=42)
################################### Question Data / 2 ###########################################
################################### Question Tasks / 2 ##########################################
# The following classification data is provided for a banking fraud analytics application.
# 1. Fit a classifier
# 2. Choose an appropriate loss function & metric for the problem
# 3. Tune hyperparameters to optimize the selected metric
# 4. Explain the results
################################### Question Tasks ##############################################
Explanetory Data Analysis, in an unconventional way. Before I start, make_classification function has some parameters that I should know;
# n_samples=100 How many samples will be in the dataset? In my case, problem says 1000.
# n_features=20, It's default, that means I have 20 lists in X.
# n_informative=2, It's complicated. But it effects how some features created.
# n_redundant=2, It says how many of the features will be totally random.
# n_repeated=0, The number of duplicated features
# n_classes=2, How many labels do we have. I have 2, luckily.
# n_clusters_per_class=2, The number of clusters per class. It sounds nice...
# weights=None, The proportions of samples assigned to each class. A true fraud det. classic.
# flip_y=0.01, Noise setting.
# class_sep=1.0, Cluster size, higher, easier.
# hypercube=True, ıdk
# shift=0.0, Shifting features.
# scale=1.0, Multiply features by the specified value.
# shuffle=True, Shuffle the samples and the features.
# random_state=None We all know.case_df = pd.DataFrame(X) # I struggled to turn this array to dataframe. It turned out so easy.
case_df['target'] = y
Data is highly inbalanced. To solve this, I can make undersampling or oversampling. Then, maybe I can apply PCA to see if it’s helps or not. I will use SMOTE, an oversampling method used in statistics. I choose SMOTE (oversample) over the undersampling because when minorty class has contains valuable info and the difference between samples huge, like my case, undersampling can lead to a serious info loss.
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Oversampling using SMOTE
oversampler = SMOTE(random_state=42)
X_train_smote, y_train_smote = oversampler.fit_resample(X_train, y_train)
# Undersampling using RUS
undersampler = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = undersampler.fit_resample(X_train, y_train)
# Logistic Regression model
model = LogisticRegression(random_state=42)
# Fit model on original data
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Confusion Matrix (Original Data):n", confusion_matrix(y_test, y_pred))
print("Classification Report (Original Data):n", classification_report(y_test, y_pred))
# Fit model on SMOTE data
model.fit(X_train_smote, y_train_smote)
y_pred = model.predict(X_test)
print("Confusion Matrix (SMOTE Data):n", confusion_matrix(y_test, y_pred))
print("Classification Report (SMOTE Data):n", classification_report(y_test, y_pred))
# Fit model on RUS data
model.fit(X_train_rus, y_train_rus)
y_pred = model.predict(X_test)
print("Confusion Matrix (RUS Data):n", confusion_matrix(y_test, y_pred))
print("Classification Report (RUS Data):n", classification_report(y_test, y_pred))
Cost-sensitive learning using XGBoost
By setting scale_pos_weight to the inverse of the class imbalance ratio, I essentially telling the algorithm to pay more attention to the minority class during training and to try to reduce the number of false negatives. This can result in improved performance on the minority class, which is often the goal in imbalanced classification problems.
Choosing the right loss function for XGBoost Classifier
Because ROC AUC is calculated based on TPR and FPR, it is less likely to be biased towards the majority class than other metrics like accuracy or precision. This is because these metrics do not take into account the imbalance in the dataset, and may give the appearance of high performance if the majority class is well-classified while the minority class is poorly classified. So, I will build model with auc and f1 score as loss functions.
model_xgb = XGBClassifier(scale_pos_weight=(1/0.01), random_state=42)
model_xgb.fit(X_train, y_train)
y_pred = model_xgb.predict(X_test)
print("Confusion Matrix (Cost-Sensitive Learning):n", confusion_matrix(y_test, y_pred))
print("Classification Report (Cost-Sensitive Learning):n", classification_report(y_test, y_pred))
I used 3 different built-in objective function (loss function) under the ‘objective’ parameter in XGBoost: 1- binary:logistic 2- binary:logitraw 3- binary:hinge
And used ‘eval_metric’ parameter to measure performance of these functions as ‘auc’. For use ‘auc’ as metric, objective function must be set as ‘binary:logistic’. To measure the performance other two loss function, I used ‘rmsle’ (root mean square log error) as ‘eval_metric’.
rmsle as metric, binary:logistic reg as loss function — before hyperparameter opt: 0.6012 rmsle as metric, binary:logistic reg as loss function — after hyperparameter opt: 0.7195
def hyp_op(X, y, model_name, cv=3, scoring="roc_auc"):
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier
print("Hyperparameter Optimization....")
best_model = {}
if model_name == "cart":
print(f"########## Decision Tree (CART) ##########")
classifier = DecisionTreeClassifier()
params = {
"max_depth": [5, 10, 20, None],
"min_samples_split": [2, 5, 10],
"min_samples_leaf": [1, 2, 4],
"criterion": ["gini", "entropy"]
}
elif model_name == "knn":
print(f"########## K-Nearest Neighbors ##########")
classifier = KNeighborsClassifier()
params = {
"n_neighbors": [3, 5, 7, 10],
"weights": ["uniform", "distance"],
"p": [1, 2]
}
elif model_name == "xgboost":
print(f"########## XGBoost ##########")
classifier = model_xgb
params = {
"max_depth": [3, 5, 7],
"learning_rate": [0.05, 0.1, 0.3],
"n_estimators": [50, 100, 200],
"objective": ["binary:logistic"],
'eval_metric': ['auc']...
# you can see the full code from github repo at the end of the post.
1 I began by examining the datasets. Given that the data was fabricated, I was confident there were no missing values. My focus was on identifying outliers and understanding the metrics on which the dataset was built. This analysis provided ample insights into the data.
2The first significant challenge was addressing the imbalanced data distribution. To tackle this, I used the imbalanced-learn
library and specifically employed the SMOTE method. After evaluating both undersampling and oversampling strategies, I opted for oversampling.
3Next, I constructed my machine learning model using xgboost
, a widely recognized tool known for its exceptional performance in classification tasks.
4I tweaked the model to prioritize the minority class, making it more resilient to False Negatives. Depending on project specifics, such an approach may vary. For instance, if the project demands heightened sensitivity to False Positives, incorrectly flagging a transaction as FRAUD could lead to serious complications.
5Conversely, I aimed for a high accuracy in detecting genuine FRAUD transactions, meaning the algorithm should minimize False Negatives over False Positives. As always, the ideal balance is project-dependent.
6I employed a custom function for hyperparameter tuning, enabling me to specify model names and parameters more efficiently.
Conclusion
In this analysis, I found that a simple logistic regression can sometimes outshine tree-based methods. My primary goal was to demonstrate my skills in data interpretation, model selection, and hyperparameter tuning. For future steps, I’m considering setting up a pipeline, exporting as a .pkl file, and deploying with tools like MLflow, monitored through platforms like Airflow or Prefect. If you have any questions or need clarity on any aspect, please don’t hesitate to reach out. I’m always available to discuss!
For any question or suggestions, you can reach me anytime on LinkedIn, Twitter or via email at bcsayilar@gmail.com :’) See y.