How to Detect Online Fraud Using Python: The Importance of Real-World Applications
Frauds are really in many transactions. We can apply machine learning algorithms to lie the past data and predict the possibility of a transaction being a fraud transaction.
In our example, we will take credit card transactions, analyze the data, create the features and labels and finally apply one of the ML algorithms to judge the nature of the transaction as being a fraud or not. Then we will find out the accuracy, precision as well as f-score of the model we are chosen.
Section 1: Introduction to ML algorithms
Before starting, let us give a little background on the algorithm
ML is a mathematical representation of statistical machine learning. The algorithm works like this, you feed an algorithm with input data and the algorithm generates a model with probable output and the output is used for prediction. You are usually not able to see the model’s output as it changes drastically when the input data is changed. It is a pretty abstract system, but the general idea is this, you feed a set of training data to a trained model and predict some outcomes based on the model’s predictions. In other words, we input a set of documents and the model will predict the properties of this set.
The scenario of the blog post
Now we have the data set, the rules, and features to create a simple machine learning model. Let’s have a look at how it works and what kind of prediction can be made.
Generate the data
We can take any data set of transactions from any domain and build a model from it. We have a simple money payment workflow that lets me pay the bank and withdraw money.
To generate the data set we used the mining method. As you can see in the code, we first search for the solution from this very dataset using a numeric keyword. As mentioned in the blog post we need to understand that for fraud detection we are interested only in inaccuracy. So we need to build a model, which also defines how we are going to rank the results.
Data preparation
Before we can run the models, we need to prepare the data. In our case, we can use the public Amazon S3 or Heroku Redshift storage. We can use either of the public, cheap as well as flexible cloud storage to do our preprocessing. We can also set up local, private, and semi-private database servers to do the preprocessing before we can start building the features for the models and doing the preprocessing to decide which algorithm to choose to find fraud and detect it as a fraud transaction.
Using AWS S3 storage
The first thing we need to do is to create an S3 bucket that holds the data that we are working on and assign a unique S3 bucket to the every time that we will update the data.
Dataset for Fraud Detection
During the fraud detection phase, we can use multiple datasets for fraud detection purposes. But it is recommended to use a set of data that have high similarities in terms of activities of transactions. Basically, you have to find datasets that contain a lot of the activity similar to the activity of frauds.
There are several online resources where you can get rich datasets for fraud detection purposes.
Data Preparation for Fraud Detection
Fraud detection data is like a pile of unsorted data. To make it ready for ML, we need to process it first. If your data set is already assembled, you can directly use ML algorithms such as linear and logistic regression. It will provide pretty accurate results. But if you don’t have ready-to-go data, you can save a lot of time and effort by using pre-processing tools for fraud detection. They will help you in making the data ready for the machine learning algorithms. The information that you are going to feed into a machine learning algorithm can be expressed in different ways and you need to check the data distribution for various values. In fraud detection, the distribution of the values depends on the question that you are asking.
Model Selection and Building
In order to build a model, we will need data sets. While in different scopes such data might be scattered and different classifications might be accurate we will just focus on transactional data in this example. We will need six variables, one of them will be the transaction amount and the other four would be the card type and CVV, which we will label as the transaction, the card number, the merchant, and the location.
As we have defined some of the parameters in previous parts of this blog, we will need to define some new ones. Parameter ‘Jk’ is the transaction ‘J’(i.e. the payment transaction) and ‘kk’ is the amount we are interested in. By defining such parameters we will be able to define the general structure of the model we are interested in.
Train/Test Split
Train/Test split is an important concept in machine learning. This means that a large set of training data will be given to a model to learn for a particular task, while the validation set will be less than the training data for training the model. So it’s important to do a test split of a large dataset.
Classification Accuracy using Logistic Regression
The learning algorithm of this algorithm was Logistic Regression which is used to generate labels and calculate the accuracy of the predictions. As we are building a machine learning model it is always important to test and check the accuracy of the model to see if it worked properly or not.
Step 1: For the classification accuracy test, the data was set to the normal distribution which would be easy to read by the ML model.
Step 2: To solve this classification task we will train a model using a Random Forest classifier.
Step 3: The process of training the model is defined as:
Step 4: The first 20 classes were randomly selected and classify them into 3 categories based on their classification accuracy. The final 2 classes will also be shown as 3 categories.
Conclusion
We just discovered three unique cases to address against fraudsters. More than looking at raw outputs of different models we should look at their applications and predictability. Apart from this, you can test the models. Though these models perform well when tested, they are not ready to be applied to the real world and have a different application in them.
0 Comments