Enhancing Financial Security: Bank Fraud Detection using AI | by Caio Emiliano

Introduction

With the technological development of our society and the modernization of our businesses, new management, monitoring, and analysis techniques based on the use of large amounts of data have been introduced. These techniques can help minimize risk and enhance the quality of services offered to customers to succeed in such a competitive global market.

Fraud is a significant risk faced by financial companies and banks. An example is the remote use of credit cards, that is currently a fraud strategy; only a few pieces of information are required to make a purchase with someone else’s card via the internet. Having said that, traditional prevention techniques, such as PINs, passwords, and identification systems, have become inadequate and are no longer suitable for modern banking systems¹.

Source: https://www.vesta.io/blog/artificial-intelligence-fraud-prevention

At the same time, AI is a powerful tool for fraud detection because it can quickly analyze and process large amounts of data, identify patterns and anomalies that are difficult for humans to detect, and adapt to new types of fraud as they emerge². In the literature, various techniques have been used for fraud detection, including credit card frauds. These techniques include neural networks, Bayesian networks, Markov chains, Logistic Regression technique and more.

In this article, our goal is to approach the problem of fraud in bank transactions and its resolution through the logistic regression technique, which has demonstrated its power in various other areas, such as healthcare and psychology ³. To accomplish that, we’ll use the Synthetic Financial Datasets For Fraud Detection , available on Kaggle.

Data processing

Before diving into our machine learning solution, we first needed to understand our data. This ensures that our models have the best possible information to work with, resulting in more accurate results. We started by performing an exploratory analysis of our dataset, resulting in the following insights:

The standard deviation of the transaction values is very high, indicating that the data is not very homogeneous (as shown in Figure 2).
There is no information about the transaction balance for customers that start with the letter M, as mentioned in the dataset documentation. Customers that start with the letter M represent 33.81% of the total customers, a very significant number.
The number of fraudulent transactions recorded in the dataset is insignificant — 0.1291% of the total, compared to the number of legitimate transactions (as shown in Figure 3).

Figure 2: Amount transacted per transaction type

Figure 3: Number of fraud occurrences vs non-fraud occurrences.

To address the class imbalance in the dataset, we used the Python imblearn library to apply the under-sampling technique. This was necessary because the number of fraudulent transactions was much smaller than the number of non-fraudulent transactions, which would have interfered with the performance of the model.

We also had to deal with the missing data for customers that start with the letter M and normalize the column of transacted values. To address this problem, we replaced the values that represented the monetary balance of recipient clients with names starting with ‘M’ with the average.

Finally, we performed data normalization for the ‘amount’ data column, which represents the transaction value. To do this, we used the Python StandardScaler library and mapped my values to the range of -1 to 1. Normalizing the data to the range of -1 to 1 is beneficial in our case, in which we are using logistic regression model, because it ensures that all input features have a consistent scale.

This consistency is important because logistic regression calculates probabilities based on the weighted sum of input features. If the features have different scales, those with larger scales can dominate the predictions, leading to suboptimal results. By transforming the data to a common scale (-1 to 1), we prevent any single feature from disproportionately influencing the model.

Logistic Regression Model

Logistic regression is a statistical model that allows estimating the probability of the occurrence of a specific categorical outcome (Y) based on one or more predictors (X). In this model, the probability of an event occurring can be directly estimated. In the case where the dependent variable Y has only two possible states (1 or 0) and there is a set of p independent variables X₁, X₂, . . . , Xp, the logistic regression model can be written as follows: