![](https://crypto4nerd.com/wp-content/uploads/2023/10/045P7EG7ssWQ0Ii3O.jpeg)
With the technological development of our society and the modernization of our businesses, new management, monitoring, and analysis techniques based on the use of large amounts of data have been introduced. These techniques can help minimize risk and enhance the quality of services offered to customers to succeed in such a competitive global market.
Fraud is a significant risk faced by financial companies and banks. An example is the remote use of credit cards, that is currently a fraud strategy; only a few pieces of information are required to make a purchase with someone else’s card via the internet. Having said that, traditional prevention techniques, such as PINs, passwords, and identification systems, have become inadequate and are no longer suitable for modern banking systems¹.
At the same time, AI is a powerful tool for fraud detection because it can quickly analyze and process large amounts of data, identify patterns and anomalies that are difficult for humans to detect, and adapt to new types of fraud as they emerge². In the literature, various techniques have been used for fraud detection, including credit card frauds. These techniques include neural networks, Bayesian networks, Markov chains, Logistic Regression technique and more.
In this article, our goal is to approach the problem of fraud in bank transactions and its resolution through the logistic regression technique, which has demonstrated its power in various other areas, such as healthcare and psychology ³. To accomplish that, we’ll use the Synthetic Financial Datasets For Fraud Detection , available on Kaggle.
Data processing
Before diving into our machine learning solution, we first needed to understand our data. This ensures that our models have the best possible information to work with, resulting in more accurate results. We started by performing an exploratory analysis of our dataset, resulting in the following insights:
- The standard deviation of the transaction values is very high, indicating that the data is not very homogeneous (as shown in Figure 2).
- There is no information about the transaction balance for customers that start with the letter M, as mentioned in the dataset documentation. Customers that start with the letter M represent 33.81% of the total customers, a very significant number.
- The number of fraudulent transactions recorded in the dataset is insignificant — 0.1291% of the total, compared to the number of legitimate transactions (as shown in Figure 3).
To address the class imbalance in the dataset, we used the Python imblearn library to apply the under-sampling technique. This was necessary because the number of fraudulent transactions was much smaller than the number of non-fraudulent transactions, which would have interfered with the performance of the model.
We also had to deal with the missing data for customers that start with the letter M and normalize the column of transacted values. To address this problem, we replaced the values that represented the monetary balance of recipient clients with names starting with ‘M’ with the average.
Finally, we performed data normalization for the ‘amount’ data column, which represents the transaction value. To do this, we used the Python StandardScaler library and mapped my values to the range of -1 to 1. Normalizing the data to the range of -1 to 1 is beneficial in our case, in which we are using logistic regression model, because it ensures that all input features have a consistent scale.
This consistency is important because logistic regression calculates probabilities based on the weighted sum of input features. If the features have different scales, those with larger scales can dominate the predictions, leading to suboptimal results. By transforming the data to a common scale (-1 to 1), we prevent any single feature from disproportionately influencing the model.
Logistic Regression Model
Logistic regression is a statistical model that allows estimating the probability of the occurrence of a specific categorical outcome (Y) based on one or more predictors (X). In this model, the probability of an event occurring can be directly estimated. In the case where the dependent variable Y has only two possible states (1 or 0) and there is a set of p independent variables X₁, X₂, . . . , Xp, the logistic regression model can be written as follows:
The coefficients B₁, B₂, . . . , Bp are estimated from the dataset using the maximum likelihood method, which finds a combination of coefficients that maximizes the probability of the sample being observed.
In our case of bank fraud prediction, where our logistic regression model will be used to estimate two groups (fraud and non-fraud), the classification rule is as follows:
- if P(Y=1) > 0,5 then we classify Y as 1
- if P(Y=1) < 0,5 then we classify Y as 0
Results
The results obtained for bank fraud detection are shown in Table 1, with an overall accuracy of approximately 94%. The parameters of the logistic regression method were chosen as follows (default):
- max iterations: 100
- solver: lbfgs
- regularization: l2
- tolerance: 1e-4
We also explored other OKR metrics like ROC and AUC for our logistic regression algorithm.
ROC, or Receiver Operating Characteristic, is a graphical representation of a model’s ability to distinguish between two classes, typically used in binary classification problems. It plots the true positive rate (sensitivity) against the false positive rate (1-specificity) for different threshold values. The ROC curve helps us visualize how well our model performs in terms of classifying positive and negative instances, and it provides a tool to evaluate and compare different models.
AUC, or Area Under the ROC Curve, is a numerical measure derived from the ROC curve. It quantifies the overall performance of a classification model across all possible threshold values. AUC represents the probability that the model will rank a randomly chosen positive example higher than a randomly chosen negative example. In simple terms, the higher the AUC, the better the model is at distinguishing between the two classes. An AUC of 1 indicates a ideal perfect model, while an AUC of 0.5 suggests no discriminative ability (equivalent to random guessing).
For the problem addressed in this paper, I obtained an AUC value of 98.24%, which represents a very good result.
Conclusion
Fraud detection is a current need for multiple industries, especially banks. In this context, we proposed a system for detecting bank fraud based on the logistic regression technique.
We conducted a complete exploratory analysis on the available data and implemented some improvements in data quality. The overall accuracy obtained for logistic regression was approximately 94.03% and an AUC of 98%, which are well-satisfying metrics for the problem at hand.
However, I believe that the results can be improved by studying the influence of various parameters used by the logistic regression method.
¹ Credit fraud detection in the banking sector in UK: a focus on e-business