{"id":1825,"date":"2020-09-23T05:26:14","date_gmt":"2020-09-23T05:26:14","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/09\/23\/credit-card-fraud-detection-with-classification-algorithms-in-python\/"},"modified":"2020-09-23T05:26:14","modified_gmt":"2020-09-23T05:26:14","slug":"credit-card-fraud-detection-with-classification-algorithms-in-python","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/09\/23\/credit-card-fraud-detection-with-classification-algorithms-in-python\/","title":{"rendered":"Credit Card Fraud Detection With Classification Algorithms In Python"},"content":{"rendered":"<div id=\"tve_editor\" data-post-id=\"6113\">\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-174b6b26410\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/1-Credit-card-fraud-detection-with-classification-algorithms.png?resize=613%2C368&amp;ssl=1\" class=\"tve_image wp-image-6119\" alt=\"Credit card fraud detection with classification algorithms\" data-id=\"6119\" width=\"613\" data-init-width=\"750\" height=\"368\" data-init-height=\"450\" title=\"Credit card fraud detection with classification algorithms\" loading=\"lazy\" data-width=\"613\" data-height=\"368\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-6119\" alt=\"Credit card fraud detection with classification algorithms\" data-id=\"6119\" width=\"613\" data-init-width=\"750\" height=\"368\" data-init-height=\"450\" title=\"Credit card fraud detection with classification algorithms\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/1-Credit-card-fraud-detection-with-classification-algorithms.png?resize=613%2C368&amp;ssl=1\" data-width=\"613\" data-height=\"368\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\" data-css=\"tve-u-174b6afeaf2\">\n<p dir=\"ltr\">Fraud transactions or fraudulent activities are significant issues in many industries like <strong>banking, insurance<\/strong>, etc. Especially for the banking industry, credit card fraud detection is a pressing issue to resolve.<\/p>\n<p dir=\"ltr\">These industries suffer too much due to fraudulent activities towards <strong>revenue growth and lose customer\u2019s trust<\/strong>. So these companies need to find fraud transactions before it becomes a big problem for them. \u00a0<\/p>\n<p dir=\"ltr\">Unlike the other machine learning problems, in credit card fraud detection the target class distribution is <strong>not equally<\/strong> distributed. It is popularly known as the <strong>class imbalance problem<\/strong> or unbalanced data issue.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_tw_qs tve_clearfix\" data-url=\"https:\/\/twitter.com\/intent\/tweet\" data-via=\"\" data-use_custom_url=\"\" data-css=\"tve-u-174b6afeb37\">\n<div class=\"thrv_tw_qs_container\">\n<div class=\"thrv_tw_quote\">\n<p>Learn how to build machine learning models with classification algorithms to detect the credit card frauds in python<\/p>\n<\/div>\n<p>\n\t\t\t<span><br \/>\n\t\t\t\t<i><\/i><br \/>\n\t\t\t\t<span class=\"thrv_tw_qs_button_text thrv-inline-text tve_editable\">Click to Tweet<\/span><br \/>\n\t\t\t<\/span>\n\t\t<\/p>\n<\/div>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element tve-froala fr-box fr-basic\" data-css=\"tve-u-174b6afeb39\">\n<p dir=\"ltr\">This makes this problem even more challenging to solve.<\/p>\n<p dir=\"ltr\">So In this article, we will explain to you how to build credit card fraud detection using different <a href=\"https:\/\/dataaspirant.com\/classification-clustering-alogrithms\/\" target=\"_blank\" class=\"tve-froala\" rel=\"noopener noreferrer\">machine learning classification algorithms<\/a>.\u00a0<\/p>\n<p dir=\"ltr\">Such as,<\/p>\n<p dir=\"ltr\">You will also get an idea about the impact of unbalanced data on the model\u2019s performance.<\/p>\n<p dir=\"ltr\">Let us give you the list of contents that we will discuss in the next few minutes. Just to give you a glimpse about the topics that you are going to learn from this article.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\" data-css=\"tve-u-174b6afeb3c\">\n<p>Let\u2019s begin the discussion by understanding why we need to find fraudulent transactions\/activities in any industry.<\/p>\n<h2 class=\"\" id=\"t-1600792769584\">Why do we need to find fraud transactions?<\/h2>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-174b6b5df7c\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/2-Credit-Card-Fraudulent-Transactions-Percentanges.png?resize=613%2C368&amp;ssl=1\" class=\"tve_image wp-image-6127\" alt=\"Credit Card Fraudulent Transactions Percentanges\" data-id=\"6127\" width=\"613\" data-init-width=\"750\" height=\"368\" data-init-height=\"450\" title=\"Credit Card Fraudulent Transactions Percentanges\" loading=\"lazy\" data-width=\"613\" data-height=\"368\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-6127\" alt=\"Credit Card Fraudulent Transactions Percentanges\" data-id=\"6127\" width=\"613\" data-init-width=\"750\" height=\"368\" data-init-height=\"450\" title=\"Credit Card Fraudulent Transactions Percentanges\" loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/2-Credit-Card-Fraudulent-Transactions-Percentanges.png?resize=613%2C368&amp;ssl=1\" data-width=\"613\" data-height=\"368\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Credit Card Fraudulent Transactions Percentanges<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">For many companies, fraud detection is a big problem because they find these fraudulent activities after they experience <strong>high loss<\/strong>.\u00a0<\/p>\n<p dir=\"ltr\">Fraud activities happen in all\u00a0 industries. We can&#8217;t say only particular companies\/industries suffer from these fraudulent activities or transactions.\u00a0<\/p>\n<p dir=\"ltr\">But when it comes to <strong>financial-related<\/strong> companies, this fraud transaction becomes more of an issue\/problem. \u00a0So these companies want to detect fraud transactions before the fraud activities turn into significant damage to their company.<\/p>\n<p dir=\"ltr\">In the current generation, with high-end technology, still, on every 100 credit card transactions, <strong>13% are falling<\/strong> into the fraudulent activities reported by the creditcards website.<\/p>\n<p dir=\"ltr\">A survey paper mentioned that in the <strong>year 1997, 63%<\/strong> of companies experienced one fraud in the past two years, and in another <strong>year 1999, <\/strong><strong>57%<\/strong> of companies experienced at least one fraud in the last one year.\u00a0<\/p>\n<p dir=\"ltr\">Here the point is not only fraud activities increase, but the way of doing scams also increases badly.\u00a0<\/p>\n<p dir=\"ltr\">Companies suffer from detecting fraud, and due to these fraudulent activities, many companies worldwide have lost billions of dollars yearly.<\/p>\n<p dir=\"ltr\">And one more thing, for any company, customer&#8217;s trust is more important to achieve or reach some position in the business marketplace. If a company cannot find these fraudulent activities, companies lose customer&#8217;s trust; then, they will suffer from customer churn.<\/p>\n<h3 id=\"t-1600792769585\" class=\"\">Fraud Detection Approaches<\/h3>\n<p dir=\"ltr\">So companies start to detect these fraud activities automatically by using smart technologies.\u00a0<\/p>\n<p dir=\"ltr\">First, companies <strong>hire few people<\/strong> only for the detection of these kinds of activities or transactions. But here they must and should be experts in this field or domain, and also the team should have knowledge of how frauds occur in particular domains. This requires more resources, such as people&#8217;s effort and time.<\/p>\n<p dir=\"ltr\">Second, companies <strong>changed manual<\/strong> processes to rule-based solutions. But this one also fails most of the time to detect frauds.\u00a0<\/p>\n<p dir=\"ltr\">Because in the real world, the way of doing frauds is changing drastically day by day. These rule-based systems follow some <strong>rules and conditions<\/strong>. If a new fraud process is different from others, then these systems fail. It requires adding that new rule to code and execute.\u00a0<\/p>\n<p dir=\"ltr\">Now companies are trying to adopt <a href=\"https:\/\/dataaspirant.com\/category\/data-science-2\/\" target=\"_blank\" rel=\"noopener noreferrer\">Artificial Intelligence<\/a> or machine learning algorithms to detect frauds. Machine learning algorithms performed very well for this type of problem.\u00a0<\/p>\n<h2 id=\"t-1600792769586\" class=\"\">What is Credit Card Fraud Detection?<\/h2>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-174b6b771ec\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/3-Credit-Card-Fraud-Detection.png?resize=626%2C376&amp;ssl=1\" class=\"tve_image wp-image-6130\" alt=\"Credit Card Fraud Detection\" data-id=\"6130\" width=\"626\" data-init-width=\"750\" height=\"376\" data-init-height=\"450\" title=\"Credit Card Fraud Detection\" loading=\"lazy\" data-width=\"626\" data-height=\"376\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-6130\" alt=\"Credit Card Fraud Detection\" data-id=\"6130\" width=\"626\" data-init-width=\"750\" height=\"376\" data-init-height=\"450\" title=\"Credit Card Fraud Detection\" loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/3-Credit-Card-Fraud-Detection.png?resize=626%2C376&amp;ssl=1\" data-width=\"626\" data-height=\"376\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Credit Card Fraud Detection<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">In the above section, we discussed the need for identifying fraudulent activities. The credit card fraud classification problem is used to find fraud transactions or fraudulent activities before they become a major problem to credit card companies.\u00a0<\/p>\n<p dir=\"ltr\">It uses the combination of <strong>fraud and non-fraud transactions<\/strong> from the historical data with different people&#8217;s credit card transaction data to estimate fraud or non-fraud on credit card transactions.<\/p>\n<p dir=\"ltr\">In this article, we are using the <strong>popular credit card dataset<\/strong>. Let\u2019s understand the data before we start building the fraud detection models.<\/p>\n<h2 id=\"t-1600792769587\" class=\"\">Understanding of Credit Card Dataset\u00a0<\/h2>\n<p dir=\"ltr\">For this credit card fraud classification problem, we are using the dataset which was downloaded from the <strong>Kaggle platform.<\/strong>\u00a0<\/p>\n<p dir=\"ltr\">You can find and download the dataset from <a href=\"https:\/\/www.kaggle.com\/mlg-ulb\/creditcardfraud\">here<\/a>.<\/p>\n<p dir=\"ltr\">Before going to the model development part, we should have some knowledge about our dataset.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-174b6b88f02\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/4-Understand-Credit-Card-Dataset.png?resize=626%2C376&amp;ssl=1\" class=\"tve_image wp-image-6133\" alt=\"Understand Credit Card Dataset\" data-id=\"6133\" width=\"626\" data-init-width=\"750\" height=\"376\" data-init-height=\"450\" title=\"Understand Credit Card Dataset\" loading=\"lazy\" data-width=\"626\" data-height=\"376\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-6133\" alt=\"Understand Credit Card Dataset\" data-id=\"6133\" width=\"626\" data-init-width=\"750\" height=\"376\" data-init-height=\"450\" title=\"Understand Credit Card Dataset\" loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/4-Understand-Credit-Card-Dataset.png?resize=626%2C376&amp;ssl=1\" data-width=\"626\" data-height=\"376\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Understand Credit Card Dataset<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Such as\u00a0<\/p>\n<ol class=\"\">\n<li>What is the size of the dataset?<\/li>\n<li>How many features does the dataset have?<\/li>\n<li>What are the target values?<\/li>\n<li>How many samples under each target value? , etc.<\/li>\n<\/ol>\n<p dir=\"ltr\">If we know some information about the dataset, then we can decide what we have to do?.\u00a0<\/p>\n<p dir=\"ltr\">What are the questions we discussed above, all \u00a0we can explore by using the python <strong>pandas library<\/strong>.\u00a0<\/p>\n<p dir=\"ltr\">Let&#8217;s jump to the data exploration part to find answers to all questions we have.<\/p>\n<h3 id=\"t-1600792769588\" class=\"\">Data Explorations<\/h3>\n<p dir=\"ltr\">First, we need to load the dataset. After downloading the dataset, extract the data and keep the file in the dataset under the project folder.\u00a0<\/p>\n<p dir=\"ltr\">We can quickly load it using pandas.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Our dataset is a CSV(Comma Separated Values) file. We can use the <strong>read_csv<\/strong> function from pandas to read the file.\u00a0<\/p>\n<p dir=\"ltr\">Ok, now find the answers for our above dataset related questions.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p>Dataset has <strong>284807 rows and 31 features<\/strong>. The result of the shape variable is a tuple that has the number of rows, number of columns of the dataset.<\/p>\n<p>We can see how the dataset looks like. The below command showcases \u00a0only five rows, <strong>head()<\/strong> by default, gives <strong>5<\/strong> samples.\u00a0<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-174b6cc8887\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/credit-card-data-observations.png?resize=626%2C142&amp;ssl=1\" class=\"tve_image wp-image-6140\" alt=\"credit card data observations\" data-id=\"6140\" width=\"626\" data-init-width=\"1358\" height=\"142\" data-init-height=\"308\" title=\"credit card data observations\" loading=\"lazy\" data-width=\"626\" data-height=\"142\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-6140\" alt=\"credit card data observations\" data-id=\"6140\" width=\"626\" data-init-width=\"1358\" height=\"142\" data-init-height=\"308\" title=\"credit card data observations\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/credit-card-data-observations.png?resize=626%2C142&amp;ssl=1\" data-width=\"626\" data-height=\"142\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Credit card data observations<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">If you want to see more samples from the top, pass the number representing the number of samples you want to see like <strong>fraud_df.head(10).<\/strong>\u00a0<\/p>\n<p dir=\"ltr\">You can also see bottom samples by using the <strong>tail()<\/strong> function. Both are working in the same way.<\/p>\n<p dir=\"ltr\">We can get all the list of feature names.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">From this, we know Class is the target variable, and the remaining all are features of our dataset.<\/p>\n<p dir=\"ltr\">Let&#8217;s see what are the unique values we are having for the target variable.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">The target variable Class has <strong>0 and 1<\/strong> values. Here<\/p>\n<ul class=\"\">\n<li>0 for <strong>non-fraudulent<\/strong> transactions<\/li>\n<li>1 for <strong>fraudulent<\/strong> transactions<\/li>\n<\/ul>\n<p dir=\"ltr\">Because we aim to find fraudulent transactions, the dataset&#8217;s target value has a positive value for that.\u00a0<\/p>\n<p dir=\"ltr\">Still, What is pending in data exploration questions?\u00a0<\/p>\n<p dir=\"ltr\">yeah, we have to check how many samples each target class is having.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Yeah, we have <strong>284315<\/strong> non-fraudulent transaction samples &amp; <strong>492<\/strong> fraudulent transaction samples.<\/p>\n<p dir=\"ltr\">We will discuss more about the data in the later sections of this article.\u00a0<\/p>\n<p dir=\"ltr\">You are going to know the variation of this number of samples and how much impact on the <strong>model&#8217;s performance<\/strong>, how we can evaluate model performance for this data, etc.<\/p>\n<p dir=\"ltr\">Still, now you only know about the dataset, such<\/p>\n<ul class=\"\">\n<li>Dataset size<\/li>\n<li>Number of samples(rows) and features(columns)<\/li>\n<li>Names of the features<\/li>\n<li>About target variables, etc.<\/li>\n<\/ul>\n<p dir=\"ltr\">Now we will discuss different data preprocessing techniques for our dataset.\u00a0<\/p>\n<p dir=\"ltr\">The data preprocessing techniques will be completely different from the text preprocessing techniques we discussed in the <a href=\"https:\/\/dataaspirant.com\/nlp-text-preprocessing-techniques-implementation-python\/\" target=\"_blank\" rel=\"noopener noreferrer\">natural language processing data preprocessing techniques<\/a> article\u00a0<\/p>\n<h2 id=\"t-1600793624237\" class=\"\">Credit Card Data Preprocessing<\/h2>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-174b6d6310c\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/5-credit-card-data-preprocessing.png?resize=626%2C376&amp;ssl=1\" class=\"tve_image wp-image-6147\" alt=\"Credit card data preprocessing\" data-id=\"6147\" width=\"626\" data-init-width=\"750\" height=\"376\" data-init-height=\"450\" title=\"Credit card data preprocessing\" loading=\"lazy\" data-width=\"626\" data-height=\"376\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-6147\" alt=\"Credit card data preprocessing\" data-id=\"6147\" width=\"626\" data-init-width=\"750\" height=\"376\" data-init-height=\"450\" title=\"Credit card data preprocessing\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/5-credit-card-data-preprocessing.png?resize=626%2C376&amp;ssl=1\" data-width=\"626\" data-height=\"376\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Credit card data preprocessing<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Preprocessing is the process of cleaning the dataset. In this step, we will apply different methods to clean the raw data to feed more meaningful data for the modeling phase. This method includes <\/p>\n<ul class=\"\">\n<li>Remove duplicates or irrelevant samples<\/li>\n<li>Update missing values with the most relevant values\u00a0<\/li>\n<li>Convert one data type to another example, categorical to integers, etc.<\/li>\n<\/ul>\n<p dir=\"ltr\">Okay, now we will spend a couple of minutes checking the dataset and applying corresponding techniques to clean data.\u00a0<\/p>\n<p dir=\"ltr\">This step aims to improve the quality of the data.<\/p>\n<h3 id=\"t-1600793624238\" class=\"\">Removing irrelevant columns\/features<\/h3>\n<p>In our dataset, only one irrelevant or not useful feature id Time. So we can drop that feature from the dataset.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">If you want to drop more features from data, call <strong>drop()<\/strong> method with a list of feature names.\u00a0<\/p>\n<p dir=\"ltr\">We can observe no feature name Time in the list of feature names after dropping the Time feature\/column.<\/p>\n<h3 id=\"t-1600793624239\" class=\"\">Checking null or nan values\u00a0<\/h3>\n<p dir=\"ltr\">We can check the datatypes of all features and, at the same time, the number of non-null values of all features by using <strong>info() <\/strong>of pandas.\u00a0<\/p>\n<p dir=\"ltr\">Null or nan values are nothing, but there is no value for that particular feature or attribute. <\/p>\n<p dir=\"ltr\">For example, these nan or null values are coming if the customer or <strong>user does not <\/strong>fill all information in the forms. Blank values are treated as null or nan values.\u00a0<\/p>\n<p dir=\"ltr\">It&#8217;s okay; we can know all this information just by using info() from pandas.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">See the result of dataset info();\u00a0<\/p>\n<p dir=\"ltr\">it provides all information about our dataset, such as\u00a0<\/p>\n<ul class=\"\">\n<li>Total number of samples or rows<\/li>\n<li>Column names<\/li>\n<li>Number of non-null values<\/li>\n<li>The data type of each column<\/li>\n<\/ul>\n<p dir=\"ltr\">Our dataset doesn\u2019t have any null values because the total number features are 284807 that ranges from 0-284806; all features have the same number of samples\/rows. <\/p>\n<h3 id=\"t-1600793624240\" class=\"\">Data Transformation<\/h3>\n<p dir=\"ltr\">Except for the Amount column, all column\u2019s values are within some range of values. So let&#8217;s change the Amount columns values to a smaller range of numbers.\u00a0<\/p>\n<p>We can simply do this process by using <strong>StandardScaler<\/strong> from the sklearn library.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">See the values of the <strong>Amount<\/strong> feature values are in high range compared to other feature values.\u00a0<\/p>\n<p dir=\"ltr\">We will change values within a smaller range.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">The scalar result is added as a new column with <strong>norm_amount<\/strong> name to the data frame after we drop the Amount column because there is no use with it.<\/p>\n<h3 id=\"t-1600793624241\" class=\"\">Splitting dataset\u00a0<\/h3>\n<p dir=\"ltr\">Now we will take all independent columns (target column is dependent and the remaining all are independent columns to each other), as <strong>X<\/strong> and the <strong>target<\/strong> variable as <strong>y<\/strong>.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Now we need to split the whole dataset into train and test dataset. Training data is used at the time of building the model and a test dataset is used to evaluate trained models.\u00a0<\/p>\n<p dir=\"ltr\">By using the <strong>train_test_split<\/strong> method from the sklearn library we can do this process of splitting the dataset to train and test sets.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element tve-froala fr-box fr-basic\">\n<p dir=\"ltr\">Now our dataset is ready for building models. Let&#8217;s jump to the development of \u00a0the model using machine learning algorithms such as decision tree and <a href=\"https:\/\/dataaspirant.com\/random-forest-classifier-python-scikit-learn\/\" target=\"_blank\" class=\"tve-froala\" rel=\"noopener noreferrer\">random forest classification algorithms<\/a> from the sklearn module.<\/p>\n<h2 id=\"t-1600793624242\" class=\"\">Building Credit Card Fraud Detection using Machine Learning algorithms<\/h2>\n<p dir=\"ltr\">Now we can build models using different machine learning algorithms. Before creating a model, we need to find the type of problem statement, which means is supervised or unsupervised algorithms.\u00a0<\/p>\n<p dir=\"ltr\">Our problem statement falls under the supervised learning problem means the dataset has a target value for each row or sample in the dataset.\u00a0<\/p>\n<p dir=\"ltr\"><a href=\"https:\/\/dataaspirant.com\/supervised-and-unsupervised-learning\/\" target=\"_blank\" rel=\"noopener noreferrer\">Supervised machine learning algorithms<\/a> are two types\u00a0<\/p>\n<ul class=\"\">\n<li>Classification Algorithms<\/li>\n<li>Regression Algorithms<\/li>\n<\/ul>\n<p class=\"class=\" dir=\"ltr\">Our problem statement belongs to what type of algorithms?\u00a0<\/p>\n<p class=\"class=\" dir=\"ltr\">Yeah, exactly.<\/p>\n<p class=\"class=\" dir=\"ltr\">Credit card fraud detection is a <strong>classification<\/strong> problem. Target variable values of Classification problems have integer(0,1) or categorical values(fraud, non-fraud). The target variable of our dataset \u2018Class\u2019 has only two labels &#8211; 0 (non-fraudulent) and 1 (fraudulent).<\/p>\n<p class=\"class=\" dir=\"ltr\">Before going further let us give an introduction for both decision tree classification and random forest classification. As in this article, we are going to use these two algorithms to build the credit card fraudulent activities identification model.<\/p>\n<ul class=\"\">\n<li class=\" dir=\">Decision Tree Classification Algorithm<\/li>\n<li class=\" dir=\">Random Forest Classification Algorithm<\/li>\n<\/ul>\n<h3 id=\"t-1600793624243\" class=\"\">Decision Tree Algorithm Overview<\/h3>\n<p dir=\"ltr\">The decision tree is the simplest and most popular classification algorithm. For building the model the decision tree algorithm considers all the provided features of the data and comes up with the <strong>important<\/strong> features.<\/p>\n<p dir=\"ltr\">Because of this advantage, the decision tree algorithms also used in identifying the importance of the feature metrics. Which used in <strong>handpicking<\/strong> the features.\u00a0<\/p>\n<p dir=\"ltr\">Once the important features identified then the model trains with the training data to come up with a <strong>set of rules<\/strong>. These rules used in predicting future cases or for the test dataset.\u00a0<\/p>\n<p dir=\"ltr\">This is a quick overview of the decision tree algorithm. If you want to learn more about the algorithm and implement in python, have a look at the below articles written by our team.<\/p>\n<p dir=\"ltr\">Now let\u2019s see a quick overview of the random forest algorithm.<\/p>\n<h3 id=\"t-1600793624244\" class=\"\">Random Forest Algorithm Overview<\/h3>\n<p dir=\"ltr\">The random forest algorithm falls under the <strong>ensemble learning algorithm<\/strong> category. In the random forest algorithm, we build N decision tree models. \u00a0<\/p>\n<p dir=\"ltr\">All the models predict the target value. Using the <strong>majority voting<\/strong> approach the final target value will be predicted.<\/p>\n<p dir=\"ltr\">For building the individual decision tree, the random forest algorithm randomly creates the sample dataset. These sample datasets are called as the <strong>bootstrap samples<\/strong>.<\/p>\n<p dir=\"ltr\">Suppose we want to build the N decision trees to create the forest, the algorithm first creates N bootstrap samples. Later for each bootstrap sample, one decision tree model will build.<\/p>\n<p dir=\"ltr\">This is a quick overview of the random forest algorithm, If you want to learn more, please have a look at the below articles.<\/p>\n<p dir=\"ltr\">Now let\u2019s go to the implementation part, the crazy one \ud83d\ude42<\/p>\n<h2 id=\"t-1600793624245\" class=\"\">Credit Card Fraud Detection with Decision Tree Algorithm<\/h2>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-174b6e529bd\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/6-credit-card-fraud-detection-with-decision-tree.png?resize=626%2C376&amp;ssl=1\" class=\"tve_image wp-image-6161\" alt=\"Credit card fraud detection with decision tree\" data-id=\"6161\" width=\"626\" data-init-width=\"750\" height=\"376\" data-init-height=\"450\" title=\"Credit card fraud detection with decision tree\" loading=\"lazy\" data-width=\"626\" data-height=\"376\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-6161\" alt=\"Credit card fraud detection with decision tree\" data-id=\"6161\" width=\"626\" data-init-width=\"750\" height=\"376\" data-init-height=\"450\" title=\"Credit card fraud detection with decision tree\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/6-credit-card-fraud-detection-with-decision-tree.png?resize=626%2C376&amp;ssl=1\" data-width=\"626\" data-height=\"376\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Credit card fraud detection with decision tree<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">We will use the <strong>DecisionTreeClassifier<\/strong> class from the sklearn library to train and evaluate models. We use <strong>X_train and y_train<\/strong> data for training purposes. X_train is a training dataset with features, and y_train is the target label.<\/p>\n<h3 id=\"t-1600793624246\" class=\"\">Decision tree algorithm Implementation using python sklearn library<\/h3>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p>The output for the above code listed below.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Wow, our decision tree classification gives <strong>99%<\/strong> accuracy on test data.\u00a0<\/p>\n<p dir=\"ltr\">But why <strong>f1-score<\/strong> on label 1 too less ?.\u00a0<\/p>\n<p dir=\"ltr\">Remember this point; we will discuss these <strong>metrics performances<\/strong> in the coming section of this article where we address the question<\/p>\n<p dir=\"ltr\">Why the accuracy evaluation metric is not suitable for this problem?<\/p>\n<h2 id=\"t-1600793624247\" class=\"\">Credit Card Fraud Detection with Random Forest Algorithm<\/h2>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-174b6ea12b1\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/7-credit-card-fraud-detection-with-random-forest.png?resize=626%2C376&amp;ssl=1\" class=\"tve_image wp-image-6166\" alt=\"Credit card fraud detection with random forest\" data-id=\"6166\" width=\"626\" data-init-width=\"750\" height=\"376\" data-init-height=\"450\" title=\"Credit card fraud detection with random forest\" loading=\"lazy\" data-width=\"626\" data-height=\"376\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-6166\" alt=\"Credit card fraud detection with random forest\" data-id=\"6166\" width=\"626\" data-init-width=\"750\" height=\"376\" data-init-height=\"450\" title=\"Credit card fraud detection with random forest\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/7-credit-card-fraud-detection-with-random-forest.png?resize=626%2C376&amp;ssl=1\" data-width=\"626\" data-height=\"376\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Credit card fraud detection with random forest<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Same as the above decision tree implementation, we use X_train and y_train dataset for training purposes and X_test for evaluation. Here we train the ensemble technique model of <strong>RandomForestClassifier<\/strong> from the sklearn. We can see the variations in the evaluation results.<\/p>\n<h3 id=\"t-1600823366412\" class=\"\">Random forest algorithm Implementation using sklearn library<\/h3>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p>The output for the above code listed below.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Wow, this model&#8217;s accuracy is also <strong>99%<\/strong> great, but what about remaining evaluation metrics such as precision, recall, F1-score.\u00a0<\/p>\n<p dir=\"ltr\">Let&#8217;s discuss these variations why it happens, all these in the coming section.<\/p>\n<h2 id=\"t-1600823366413\" class=\"\">Why Accuracy not suitable for Data Imbalance Problems?<\/h2>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-174b88996ac\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/8-How-to-measure-performance-for-data-imbalance-problems.png?resize=613%2C368&amp;ssl=1\" class=\"tve_image wp-image-6170\" alt=\"How to measure performance for data imbalance problems\" data-id=\"6170\" width=\"613\" data-init-width=\"750\" height=\"368\" data-init-height=\"450\" title=\"How to measure performance for data imbalance problems\" loading=\"lazy\" data-width=\"613\" data-height=\"368\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-6170\" alt=\"How to measure performance for data imbalance problems\" data-id=\"6170\" width=\"613\" data-init-width=\"750\" height=\"368\" data-init-height=\"450\" title=\"How to measure performance for data imbalance problems\" loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/8-How-to-measure-performance-for-data-imbalance-problems.png?resize=613%2C368&amp;ssl=1\" data-width=\"613\" data-height=\"368\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">How to measure performance for data imbalance problems<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element tve-froala fr-box fr-basic\">\n<p dir=\"ltr\">What was the reason for not applying or not considering accuracy as a performance metric for this specific problem?<\/p>\n<p dir=\"ltr\">Just take some time, think about it.<\/p>\n<p dir=\"ltr\">Model training is completed; we got accuracy on the test set as 99%.\u00a0<\/p>\n<p dir=\"ltr\">But why this section?\u00a0<\/p>\n<p dir=\"ltr\">We are having <a href=\"https:\/\/dataaspirant.com\/six-popular-classification-evaluation-metrics-in-machine-learning\/\" target=\"_blank\" class=\"tve-froala\" rel=\"noopener noreferrer\">various classification evaluation metrics<\/a> to quantify the performance of the build model, accuracy is one method in that. What other methods we can apply?<\/p>\n<p dir=\"ltr\">Now we will discuss our dataset and what are the <strong>best evaluation metrics<\/strong> for these kinds of problems.<\/p>\n<p dir=\"ltr\">For this discussion, we have to remember two things that are previously discussed.<\/p>\n<ol class=\"\">\n<li>The number of samples for each Class (target variable) value.<\/li>\n<li>Evaluation metrics at both the decision tree and random forest classification models.<\/li>\n<\/ol>\n<p dir=\"ltr\">Do you remember the number of samples\/rows for each target value?\u00a0<\/p>\n<p dir=\"ltr\">No? okay, let us check that number.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">See the number of samples for <strong>Class-1 (fraudulent)<\/strong> less than the samples for <strong>class-0 (non-fraudulent)<\/strong>.\u00a0<\/p>\n<p dir=\"ltr\">This kind of dataset is called <strong>unbalanced<\/strong> data. Which means one class label samples are \u00a0higher and dominating the other class label.\u00a0<\/p>\n<p dir=\"ltr\">For a balanced dataset, accuracy is suitable because we take the divided value of the correctly predicted samples count with the total number of samples for accuracy.\u00a0<\/p>\n<p dir=\"ltr\"><em><strong>Accuracy = number of correctly predicted samples \/ total number of samples<\/strong><\/em><\/p>\n<p dir=\"ltr\">For example.\u00a0<\/p>\n<p dir=\"ltr\">If our dataset has 20 samples, out of that 2 for Class 0 &amp; 18 for Class 1. Our trained model correctly predicted 17 samples out of 18 Class-1 samples and 0 samples out of 2 Class-0 samples.\u00a0<\/p>\n<p dir=\"ltr\">What is the accuracy value for this? 85%.<\/p>\n<p dir=\"ltr\">But this is not correct, right? Because the model doesn\u2019t even predict one sample correctly for Class-0 samples, but we got 85% accuracy.\u00a0<\/p>\n<p dir=\"ltr\">For an unbalanced dataset, a list of evaluation metrics are available. In the next section, we will discuss this.<\/p>\n<h2 id=\"t-1600823366414\" class=\"\">Suitable evaluation metrics for imbalanced data<\/h2>\n<p dir=\"ltr\">So which all metrics are suitable for unbalanced data?<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-174b88df3c1\">\n<span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/9-Evaluation-Metrics-for-imbalance-data.png?resize=613%2C368&amp;ssl=1\" class=\"tve_image wp-image-6174\" alt=\"Evaluation Metrics for imbalance data\" data-id=\"6174\" width=\"613\" data-init-width=\"750\" height=\"368\" data-init-height=\"450\" title=\"Evaluation Metrics for imbalance data\" loading=\"lazy\" data-width=\"613\" data-height=\"368\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-6174\" alt=\"Evaluation Metrics for imbalance data\" data-id=\"6174\" width=\"613\" data-init-width=\"750\" height=\"368\" data-init-height=\"450\" title=\"Evaluation Metrics for imbalance data\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/9-Evaluation-Metrics-for-imbalance-data.png?resize=613%2C368&amp;ssl=1\" data-width=\"613\" data-height=\"368\" data-recalc-dims=\"1\"><\/span><\/p>\n<p class=\"thrv-inline-text wp-caption-text\">Evaluation Metrics for imbalance data<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">We can use any of the below-mentioned metrics for unbalanced or skewed datasets.<\/p>\n<ul class=\"\">\n<li>Recall<\/li>\n<li>Precision<\/li>\n<li>F1-score<\/li>\n<li>Area Under ROC curve.<\/li>\n<\/ul>\n<p dir=\"ltr\">We can see the huge difference among different evaluation metrics for both classifications (decision tree &amp; random forest) models.\u00a0<\/p>\n<p dir=\"ltr\">Do you remember we mentioned at model development stage, accuracy, classification report, etc. ?\u00a0<\/p>\n<p dir=\"ltr\">Okay, let see the results here.<\/p>\n<h3 id=\"t-1600823366415\" class=\"\">Decision Tree Classification model results<\/h3>\n<\/div>\n<h3 class=\"\" id=\"t-1600823366416\">Random Forest Classification model results<\/h3>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Here we have to discuss a few terms and formulae related to confusion matrix, precision, recall &amp; F1-score.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-174b8f155ec\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/12-Confusion-matrix-full-representation.png?resize=626%2C408&amp;ssl=1\" class=\"tve_image wp-image-4315\" alt=\"Confusion matrix full representation\" data-id=\"4315\" width=\"626\" data-init-width=\"1024\" height=\"408\" data-init-height=\"667\" title=\"Confusion matrix full representation\" loading=\"lazy\" data-width=\"626\" data-height=\"408\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4315\" alt=\"Confusion matrix full representation\" data-id=\"4315\" width=\"626\" data-init-width=\"1024\" height=\"408\" data-init-height=\"667\" title=\"Confusion matrix full representation\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/12-Confusion-matrix-full-representation.png?resize=626%2C408&amp;ssl=1\" data-width=\"626\" data-height=\"408\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<ol class=\"\">\n<li>\n<strong>True Positive (TP):-<\/strong> \u00a0<\/li>\n<\/ol>\n<p dir=\"ltr\">The number of positive labels correctly predicted by trained models.\u00a0 This means the number of Class-1 samples correctly predicted as Class-1.<\/p>\n<ol start=\"2\" class=\"\">\n<li><strong>True Negative (TN):- <\/strong><\/li>\n<\/ol>\n<p dir=\"ltr\">The number of negative labels correctly predicted by trained models.\u00a0 This means the number of Class-0 samples correctly predicted as Class-0.<\/p>\n<ol start=\"3\" class=\"\">\n<li><strong>False Positive (FP):- \u00a0<\/strong><\/li>\n<\/ol>\n<p dir=\"ltr\">The number of positive labels incorrectly predicted by trained models. This means the number of Class-1 samples incorrectly predicted as Class-0.<\/p>\n<ol start=\"4\" class=\"\">\n<li>\n<strong>False Negative (FN):- <\/strong>\u00a0<\/li>\n<\/ol>\n<p dir=\"ltr\">The number of negative labels incorrectly predicted by trained models.\u00a0 This means the number of Class-0 samples incorrectly predicted as Class-1.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element tve-froala fr-box fr-basic\">\n<h4 class=\"\">Formulae<\/h4>\n<ul class=\"\">\n<li>Recall = TP \/ (TP + FN)<\/li>\n<li>Precision = TP \/ (TP + FP)<\/li>\n<li>F1-Score = 2*P*R \/ (P + R) here P for Precision, R for Recall<\/li>\n<\/ul>\n<p dir=\"ltr\">Both classification models got accuracy scores as 99%.\u00a0<\/p>\n<p dir=\"ltr\">But when we observe the result of the classification report of both classifiers, f1-score for Class-0 got 100%, but for Class-1, F1-scores are significantly less.\u00a0<\/p>\n<p dir=\"ltr\">All these variations occur due to the unbalanced or skewed dataset.\u00a0<\/p>\n<p dir=\"ltr\">Why f1-score for class-0 100%?\u00a0<\/p>\n<p dir=\"ltr\">Because of the number of samples for class-0 (2 lakhs). The number of samples for Class-0 is very high than the Class-1 samples.<\/p>\n<p dir=\"ltr\">So what we need to do here is handle an unbalanced dataset. If you want to learn more about it, check the <a href=\"https:\/\/dataaspirant.com\/handle-imbalanced-data-machine-learning\/\" class=\"tve-froala\">Best ways to handle unbalanced data in the machine learning<\/a> article which explained various ways to handle the imbalanced data.<\/p>\n<p dir=\"ltr\">One more thing is left for discussion in this section, which is about areas under the ROC curve.<\/p>\n<h3 id=\"t-1600823366417\" class=\"\">AUC and ROC Curves<\/h3>\n<p dir=\"ltr\">Area Under ROC curve is another evaluation metric for classification problems. This is mostly suitable for skewed datasets. It tells us about model performance, such as the model&#8217;s capability to distinguish between target classes.\u00a0<\/p>\n<p dir=\"ltr\">The effective model has a higher Area Under the <strong>ROC curve<\/strong> value. Here we measure the ability of class separability of a model by using the Area Under ROC curve.<\/p>\n<p dir=\"ltr\">Good models have AUC value <strong>near to 1<\/strong>, and the worst models have AUC value <strong>near 0.<\/strong><\/p>\n<p dir=\"ltr\">All the model performance methods help in the measuring the performance of the model based on the problem, but how to build the best models when we face with the data imbalance issue?<\/p>\n<p dir=\"ltr\">For that, we need to apply different sampling methods to the data before building the models.<\/p>\n<p dir=\"ltr\">Let\u2019s see how sampling methods improve model performance, and how much AUC score for that model in the coming section.<\/p>\n<h2 id=\"t-1600823366418\" class=\"\">Model Improvement Using Sampling Techniques<\/h2>\n<p dir=\"ltr\">Data sampling is the statistical method for selecting data points (here, the data point is a single row) from the whole dataset. In machine learning problems, there are many sampling techniques available.<\/p>\n<p dir=\"ltr\">Here we take <strong>undersampling<\/strong> and oversampling strategies for handling imbalanced data. \u00a0<\/p>\n<blockquote class=\"\"><p>\n<strong>What is this undersampling and oversampling?<\/strong>\n<\/p><\/blockquote>\n<p dir=\"ltr\">Let us take an example of a dataset that has nine samples.\u00a0<\/p>\n<ul class=\"\">\n<li class=\" dir=\">Six samples belong to class-0,<\/li>\n<li class=\" dir=\">Three samples belong to class-1<\/li>\n<\/ul>\n<p class=\"dir=\"><strong>Oversampling<\/strong> = 6 class-0 samples x \u00a02 times of class-1 samples of 3<\/p>\n<p class=\"dir=\"><strong>Undersampling<\/strong> = 3 Class-1 samples x 3 samples from Class-0<\/p>\n<p dir=\"ltr\">Here what we are trying to do is the number of samples of both target classes to be equal.\u00a0<\/p>\n<p dir=\"ltr\">In the oversampling technique, <strong>samples are repeated<\/strong>, and the dataset size is <strong>larger<\/strong> than the original dataset.<\/p>\n<p dir=\"ltr\">In the undersampling technique, <strong>samples are not repeated<\/strong>, and the dataset size is <strong>less<\/strong> than the original dataset.<\/p>\n<h3 id=\"t-1600823366419\" class=\"\">Applying Sampling Techniques\u00a0<\/h3>\n<p dir=\"ltr\">For undersampling techniques, we are checking the number of samples of both classes and selecting the smaller number and taking random samples from other class samples to create a new dataset. \u00a0<\/p>\n<p dir=\"ltr\">The new dataset has an equal number of samples for both target classes.<\/p>\n<p dir=\"ltr\">This is a whole process of undersampling, and now we are going to implement this entire process using python.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p>The above is the target class distributions, now let&#8217;s see how we can change this.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Here first, we take <strong>indexes<\/strong> of both classes and randomly choose Class-0 samples indexes that are equal to the number of Class-1 samples.\u00a0<\/p>\n<p dir=\"ltr\">In the below code snippet, Combine both classes indexes. Then we <strong>extract<\/strong> all features of gathered indexes.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">The above code first divides features and targets as <strong>x_undersample_data<\/strong> and <strong>y_undersample_data<\/strong> and then splits new undersample data into train and test dataset.<\/p>\n<p dir=\"ltr\">Okay, now we will call both classifiers with these new under sampling train and test datasets.<\/p>\n<h3 id=\"t-1600823366420\" class=\"\">Decision tree classification after applying sampling techniques<\/h3>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p>Below are the model performance details<\/p>\n<\/div>\n<h3 class=\"\" id=\"t-1600823366421\">Random Forest Tree Classifier after applying the sampling techniques<\/h3>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p>Below are the model performance details after applying the sampling techniques.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element tve-froala fr-box fr-basic\">\n<p dir=\"ltr\">See, the results of the F1-score for both target values are 95%, and the Area Under ROC curve is near to 1.\u00a0<\/p>\n<p dir=\"ltr\">For the best models, we have the AUROC value near to 1. Here we implemented the undersampling technique; you can apply oversampling also like an undersampling process.<\/p>\n<h2 id=\"t-1600823366422\" class=\"\">Conclusion<\/h2>\n<p dir=\"ltr\">Finally, our model gives 94% of the Area Under the ROC curve value. We can improve model results by adding more trees or applying additional data preprocessing techniques, etc.\u00a0<\/p>\n<p dir=\"ltr\">Not only decision trees or random forest classifiers suitable for this problem. You can try with other machine learning classification algorithms such as Support Vector Machines (SVM), k-nearest neighbors, etc. \u00a0to check how different algorithms are performed on classifying fraudulent activities.<\/p>\n<h2 id=\"t-1600823366423\" class=\"\">What next<\/h2>\n<p dir=\"ltr\">Try to use different classification algorithms to solve the same problem and check the F1 score for all the models. For <strong>implementation<\/strong>, you can have a look at the code snippets from the below articles.<\/p>\n<\/div>\n<h4 class=\"\">Recommended Courses<\/h4>\n<div class=\"thrv_wrapper thrv-page-section thrv-lp-block\" data-inherit-lp-settings=\"1\" data-css=\"tve-u-174b6afe9ba\" data-keep-css_id=\"1\">\n<div class=\"tve-page-section-in tve_empty_dropzone  \" data-css=\"tve-u-17481b960b8\">\n<div class=\"thrv_wrapper thrv-columns dynamic-group-kbt3q0q7\" data-css=\"tve-u-17481b95e2b\">\n<div class=\"tcb-flex-row v-2 tcb--cols--3 tcb-medium-no-wrap tcb-mobile-wrap m-edit\" data-css=\"tve-u-174b6afe9bb\">\n<div class=\"tcb-flex-col\">\n<div class=\"tcb-col dynamic-group-kbt3pyfd\" data-css=\"tve-u-17481b95e2d\">\n<div class=\"thrv_wrapper thrv_contentbox_shortcode thrv-content-box tve-elem-default-pad dynamic-group-kbt3pwhk\" data-css=\"tve-u-174b6afe9d1\">\n<div class=\"tve-cb\">\n<div class=\"thrv_wrapper tve_image_caption dynamic-group-kbt3pu4z\" data-css=\"tve-u-174b6afe9d5\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/credit-risk-modeling.jpg?resize=176%2C176&amp;ssl=1\" class=\"tve_image wp-image-6246\" alt=\"credit risk modeling\" data-id=\"6246\" width=\"176\" data-init-width=\"150\" height=\"176\" data-init-height=\"150\" title=\"credit risk modeling\" loading=\"lazy\" data-width=\"176\" data-height=\"176\" data-css=\"tve-u-174b6afe9d6\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-6246\" alt=\"credit risk modeling\" data-id=\"6246\" width=\"176\" data-init-width=\"150\" height=\"176\" data-init-height=\"150\" title=\"credit risk modeling\" loading=\"lazy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/credit-risk-modeling.jpg?resize=176%2C176&amp;ssl=1\" data-width=\"176\" data-height=\"176\" data-css=\"tve-u-174b6afe9d6\" data-recalc-dims=\"1\"><br \/>\n<span class=\"tve-image-overlay\"><\/span><\/span><\/div>\n<h4 class=\"\" data-css=\"tve-u-174b6afe9bd\">Credit Risk modelling in Python<\/h4>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"tcb-flex-col\">\n<div class=\"tcb-col dynamic-group-kbt3pyfd\" data-css=\"tve-u-17481b95e2d\">\n<div class=\"thrv_wrapper thrv_contentbox_shortcode thrv-content-box tve-elem-default-pad dynamic-group-kbt3pwhk\" data-css=\"tve-u-174b6afe9d2\">\n<div class=\"tve-cb\">\n<div class=\"thrv_wrapper tve_image_caption dynamic-group-kbt3pu4z\" data-css=\"tve-u-174b6afe9e1\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/data-science-bootcamp.jpg?resize=176%2C176&amp;ssl=1\" class=\"tve_image wp-image-6247\" alt=\"data science bootcamp\" data-id=\"6247\" width=\"176\" data-init-width=\"150\" height=\"176\" data-init-height=\"150\" title=\"data science bootcamp\" loading=\"lazy\" data-width=\"176\" data-height=\"176\" data-css=\"tve-u-174b6afe9e2\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-6247\" alt=\"data science bootcamp\" data-id=\"6247\" width=\"176\" data-init-width=\"150\" height=\"176\" data-init-height=\"150\" title=\"data science bootcamp\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/09\/data-science-bootcamp.jpg?resize=176%2C176&amp;ssl=1\" data-width=\"176\" data-height=\"176\" data-css=\"tve-u-174b6afe9e2\" data-recalc-dims=\"1\"><br \/>\n<span class=\"tve-image-overlay\"><\/span><\/span><\/div>\n<h4 class=\"\" data-css=\"tve-u-174b6afe9c4\">Data Science Bootcamp Course<\/h4>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"tcb-flex-col\">\n<div class=\"tcb-col dynamic-group-kbt3pyfd\" data-css=\"tve-u-17481b95e2d\">\n<div class=\"thrv_wrapper thrv_contentbox_shortcode thrv-content-box tve-elem-default-pad dynamic-group-kbt3pwhk\" data-css=\"tve-u-174b6afe9d3\">\n<div class=\"tve-cb\">\n<div class=\"thrv_wrapper tve_image_caption dynamic-group-kbt3pu4z\" data-css=\"tve-u-174b6afe9e3\"><span class=\"tve_image_frame\"><img src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/machine-learning-1.jpg?resize=176%2C176&amp;ssl=1\" class=\"tve_image wp-image-4302\" alt=\"Machine learning\" data-id=\"4302\" width=\"176\" data-init-width=\"150\" height=\"176\" data-init-height=\"150\" title=\"machine learning\" loading=\"lazy\" data-width=\"176\" data-height=\"176\" data-css=\"tve-u-174b6afe9e4\" data-recalc-dims=\"1\"><img class=\"tve_image wp-image-4302\" alt=\"Machine learning\" data-id=\"4302\" width=\"176\" data-init-width=\"150\" height=\"176\" data-init-height=\"150\" title=\"machine learning\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/machine-learning-1.jpg?resize=176%2C176&amp;ssl=1\" data-width=\"176\" data-height=\"176\" data-css=\"tve-u-174b6afe9e4\" data-recalc-dims=\"1\"><br \/>\n<span class=\"tve-image-overlay\"><\/span><\/span><\/div>\n<h4 class=\"\" data-css=\"tve-u-174b6afe9cb\">Machine Learning A to Z Course<\/h4>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/dataaspirant.com\/credit-card-fraud-detection-classification-algorithms-python\/<\/p>\n","protected":false},"author":0,"featured_media":1826,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/1825"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=1825"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/1825\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/1826"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=1825"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=1825"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=1825"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}