**The Challenge**: The focus here was on understanding relationships in data and predicting outcomes using both tree-based and non-tree-based models.

After setting up my Python environment with essential libraries, I create a dataset, which essentially models a linear relationship between `x`

and `y`

.

`import numpy as np`

N = 51

b0 = 0

b1 = 2

x = np.arange(0, N, 1)

y = b0 + b1*x + np.random.normal(0, 5, N)

## 2. Tree-based Model Building:

Used the powerful RandomForest and GradientBoosting regressors for this.

`from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor`

tree_models = [RandomForestRegressor(random_state=100), GradientBoostingRegressor()]

## 3. Non-Tree-based Model Building:

Linear regression and its regularized versions (Ridge, Lasso, ElasticNet) and SVR were used.

`from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet`

from sklearn.svm import SVR

non_tree_models = [LinearRegression(), Ridge(), Lasso(), ElasticNet(), SVR()]

## 4. Predictions for x=100:

Both tree-based and non-tree-based models were trained on the data and then used to predict the outcome for x=100.

`x_new = [[100]]`

predictions = []

for model in tree_models:

model.fit(train_df[['x_values']], train_df['y_values'])

predictions.append(model.predict(x_new)[0])for model in non_tree_models:

model.fit(train_df[['x_values']], train_df['y_values'])

predictions.append(model.predict(x_new)[0])

## 5. Visualization (Optional):

Using matplotlib, predictions from all models were visualized across the range x=1 to x=100.

`x_plot = np.arange(1, 101, 1)`

models = [rf_model,

gb_model,

LR_model,

ridge_model,

lasso_model,

elastic_model,

SVR_model]

model_names = ['Random Forest',

'Gradient Boosting',

'Logistic Regression',

'Ridge',

'Lasso',

'ElasticNet',

'SGDOneClassSVM']# A for loop to predict and plot for each model

for model, model_name in zip(models, model_names):

model_plot = model.predict(x_plot.reshape(-1, 1))

plt.plot(x_plot, model_plot, label=model_name)

plt.scatter(case_df['x_values'],

case_df['y_values'],

color='black')

plt.legend()

plt.show()

**The Challenge**: The task was to create a classifier to detect **banking fraud.**

## 1. Data Creation:

Using the `make_classification`

function from sklearn, a dataset with a class imbalance (99% to 1%) was created.

`## Imports`

import numpy as np

import pandas as pd

# Pre-processing

from imblearn.over_sampling import SMOTE

from imblearn.under_sampling import RandomUnderSampler

# Model selection

from sklearn.metrics import roc_auc_score

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

from sklearn.model_selection import GridSearchCV, cross_validate

from sklearn.model_selection import train_test_split

# Model building

from sklearn.linear_model import LogisticRegression

from xgboost import XGBClassifier################################### Question Data / 2 ###########################################

from sklearn.datasets import make_classification

X, y = make_classification(n_samples=10000, weights=(0.99, 0.01), random_state=42)

################################### Question Data / 2 ###########################################

################################### Question Tasks / 2 ##########################################

# The following classification data is provided for a banking fraud analytics application.

# 1. Fit a classifier

# 2. Choose an appropriate loss function & metric for the problem

# 3. Tune hyperparameters to optimize the selected metric

# 4. Explain the results

################################### Question Tasks ##############################################

Explanetory Data Analysis, in an unconventional way. Before I start, make_classification function has some parameters that I should know;

`# n_samples=100 How many samples will be in the dataset? In my case, problem says 1000.`

# n_features=20, It's default, that means I have 20 lists in X.

# n_informative=2, It's complicated. But it effects how some features created.

# n_redundant=2, It says how many of the features will be totally random.

# n_repeated=0, The number of duplicated features

# n_classes=2, How many labels do we have. I have 2, luckily.

# n_clusters_per_class=2, The number of clusters per class. It sounds nice...

# weights=None, The proportions of samples assigned to each class. A true fraud det. classic.

# flip_y=0.01, Noise setting.

# class_sep=1.0, Cluster size, higher, easier.

# hypercube=True, ıdk

# shift=0.0, Shifting features.

# scale=1.0, Multiply features by the specified value.

# shuffle=True, Shuffle the samples and the features.

# random_state=None We all know.case_df = pd.DataFrame(X) # I struggled to turn this array to dataframe. It turned out so easy.

case_df['target'] = y

Data is highly inbalanced. To solve this, I can make undersampling or oversampling. Then, maybe I can apply PCA to see if it’s helps or not. I will use SMOTE, an oversampling method used in statistics. I choose SMOTE (oversample) over the undersampling because when minorty class has contains valuable info and the difference between samples huge, like my case, undersampling can lead to a serious info loss.

`# Split data into training and testing sets`

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Oversampling using SMOTE

oversampler = SMOTE(random_state=42)

X_train_smote, y_train_smote = oversampler.fit_resample(X_train, y_train)

# Undersampling using RUS

undersampler = RandomUnderSampler(random_state=42)

X_train_rus, y_train_rus = undersampler.fit_resample(X_train, y_train)

# Logistic Regression model

model = LogisticRegression(random_state=42)

# Fit model on original data

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("Confusion Matrix (Original Data):n", confusion_matrix(y_test, y_pred))

print("Classification Report (Original Data):n", classification_report(y_test, y_pred))

# Fit model on SMOTE data

model.fit(X_train_smote, y_train_smote)

y_pred = model.predict(X_test)

print("Confusion Matrix (SMOTE Data):n", confusion_matrix(y_test, y_pred))

print("Classification Report (SMOTE Data):n", classification_report(y_test, y_pred))

# Fit model on RUS data

model.fit(X_train_rus, y_train_rus)

y_pred = model.predict(X_test)

print("Confusion Matrix (RUS Data):n", confusion_matrix(y_test, y_pred))

print("Classification Report (RUS Data):n", classification_report(y_test, y_pred))

## Cost-sensitive learning using XGBoost

By setting scale_pos_weight to the inverse of the class imbalance ratio, I essentially telling the algorithm to pay more attention to the minority class during training and to try to reduce the number of false negatives. This can result in improved performance on the minority class, which is often the goal in imbalanced classification problems.

## Choosing the right loss function for XGBoost Classifier

Because ROC AUC is calculated based on TPR and FPR, it is less likely to be biased towards the majority class than other metrics like accuracy or precision. This is because these metrics do not take into account the imbalance in the dataset, and may give the appearance of high performance if the majority class is well-classified while the minority class is poorly classified. So, I will build model with auc and f1 score as loss functions.

`model_xgb = XGBClassifier(scale_pos_weight=(1/0.01), random_state=42)`

model_xgb.fit(X_train, y_train)

y_pred = model_xgb.predict(X_test)

print("Confusion Matrix (Cost-Sensitive Learning):n", confusion_matrix(y_test, y_pred))

print("Classification Report (Cost-Sensitive Learning):n", classification_report(y_test, y_pred))

I used 3 different built-in objective function (loss function) under the ‘**objective**’ parameter in XGBoost: **1- binary:logistic** **2- binary:logitraw** **3- binary:hinge**

And used ‘**eval_metric**’ parameter to **measure performance** of these functions as ‘**auc**’. For use ‘auc’ as metric, objective function **must be set as ‘binary:logistic**’. To measure the performance other two loss function, I used ‘**rmsle**’ (root mean square log error) as ‘**eval_metric**’.

**rmsle as metric**, **binary:logistic reg as loss function** — **before** hyperparameter opt: **0.6012** rmsle as metric, binary:logistic reg as loss function — **after** hyperparameter opt: **0.7195**

`def hyp_op(X, y, model_name, cv=3, scoring="roc_auc"):`

from sklearn.tree import DecisionTreeClassifier

from sklearn.neighbors import KNeighborsClassifier

from xgboost import XGBClassifier

from sklearn.linear_model import LogisticRegression

from catboost import CatBoostClassifier

from lightgbm import LGBMClassifier

from sklearn.ensemble import RandomForestClassifier

print("Hyperparameter Optimization....")

best_model = {}

if model_name == "cart":

print(f"########## Decision Tree (CART) ##########")

classifier = DecisionTreeClassifier()

params = {

"max_depth": [5, 10, 20, None],

"min_samples_split": [2, 5, 10],

"min_samples_leaf": [1, 2, 4],

"criterion": ["gini", "entropy"]

}

elif model_name == "knn":

print(f"########## K-Nearest Neighbors ##########")

classifier = KNeighborsClassifier()

params = {

"n_neighbors": [3, 5, 7, 10],

"weights": ["uniform", "distance"],

"p": [1, 2]

}

elif model_name == "xgboost":

print(f"########## XGBoost ##########")

classifier = model_xgb

params = {

"max_depth": [3, 5, 7],

"learning_rate": [0.05, 0.1, 0.3],

"n_estimators": [50, 100, 200],

"objective": ["binary:logistic"],

'eval_metric': ['auc']...

# you can see the full code from github repo at the end of the post.

1 I began by examining the datasets. Given that the data was fabricated, I was confident there were no missing values. My focus was on identifying outliers and understanding the metrics on which the dataset was built. This analysis provided ample insights into the data.

2The first significant challenge was addressing the imbalanced data distribution. To tackle this, I used the `imbalanced-learn`

library and specifically employed the SMOTE method. After evaluating both undersampling and oversampling strategies, I opted for oversampling.

3Next, I constructed my machine learning model using `xgboost`

, a widely recognized tool known for its exceptional performance in classification tasks.

4I tweaked the model to prioritize the minority class, making it more resilient to False Negatives. Depending on project specifics, such an approach may vary. For instance, if the project demands heightened sensitivity to False Positives, incorrectly flagging a transaction as FRAUD could lead to serious complications.

5Conversely, I aimed for a high accuracy in detecting genuine FRAUD transactions, meaning the algorithm should minimize False Negatives over False Positives. As always, the ideal balance is project-dependent.

6I employed a custom function for hyperparameter tuning, enabling me to specify model names and parameters more efficiently.

## Conclusion

In this analysis, I found that a simple logistic regression can sometimes outshine tree-based methods. My primary goal was to demonstrate my skills in data interpretation, model selection, and hyperparameter tuning. For future steps, I’m considering setting up a pipeline, exporting as a .pkl file, and deploying with tools like **MLflow**, monitored through platforms like **Airflow** or **Prefect**. If you have any questions or need clarity on any aspect, please don’t hesitate to reach out. I’m always available to discuss!

For any question or suggestions, you can reach me anytime on LinkedIn, Twitter or via email at bcsayilar@gmail.com :’) See y.