{"id":258,"date":"2020-08-10T18:11:12","date_gmt":"2020-08-10T18:11:12","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/08\/10\/best-ways-to-handle-imbalanced-data-in-machine-learning\/"},"modified":"2020-08-10T18:11:12","modified_gmt":"2020-08-10T18:11:12","slug":"best-ways-to-handle-imbalanced-data-in-machine-learning","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/08\/10\/best-ways-to-handle-imbalanced-data-in-machine-learning\/","title":{"rendered":"Best Ways To Handle Imbalanced Data In Machine Learning"},"content":{"rendered":"<div id=\"tve_editor\" data-post-id=\"4519\">\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173d277f349\"><span class=\"tve_image_frame\"><img loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/1-handle-imbalanced-datasets.png?resize=613%2C394&amp;ssl=1\" class=\"tve_image wp-image-4524\" alt=\"handle imbalanced data\" data-id=\"4524\" width=\"613\" data-init-width=\"700\" height=\"394\" data-init-height=\"450\" title=\"handle imbalanced datasets\" data-width=\"613\" data-height=\"394\" data-recalc-dims=\"1\"><img loading=\"lazy\" class=\"tve_image wp-image-4524\" alt=\"handle imbalanced data\" data-id=\"4524\" width=\"613\" data-init-width=\"700\" height=\"394\" data-init-height=\"450\" title=\"handle imbalanced datasets\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/1-handle-imbalanced-datasets.png?resize=613%2C394&amp;ssl=1\" data-width=\"613\" data-height=\"394\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element tve-froala fr-box\">\n<p dir=\"ltr\">When dealing with any <a href=\"https:\/\/dataaspirant.com\/classification-and-prediction\/\" target=\"_blank\" class=\"tve-froala\" rel=\"noopener noreferrer\">classification<\/a> problem, we might not always get the target ratio in an equal manner. There will be situation where you will get data that was very imbalanced, i.e., <strong>not equal. <\/strong>In machine learning world we call this as class imbalanced data issue.<\/p>\n<p dir=\"ltr\"><a href=\"https:\/\/dataaspirant.com\/gaussian-naive-bayes-classifier-implementation-python\/\" target=\"_blank\" rel=\"noopener noreferrer\">Building models<\/a> for the balanced target data is more comfortable than handling imbalanced data; even the classification algorithms find it <strong>easier<\/strong> to learn from properly balanced data.\u00a0<\/p>\n<p dir=\"ltr\">But in real-world, the data is not always fruitful to build models easily. We need to handle <strong>unstructured<\/strong> data, and we need to <strong>handle imbalance<\/strong> data.<\/p>\n<p dir=\"ltr\">So as a data scientist or analyst, you need to know how to deal with class imbalance.<\/p>\n<p dir=\"ltr\">In this article, we are going to give insights about how to deal with this situation. There are various techniques used to handle imbalance data. Let&#8217;s learn about them in detail along with implementation in python.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_tw_qs tve_clearfix\" data-url=\"https:\/\/twitter.com\/intent\/tweet\" data-via=\"\">\n<div class=\"thrv_tw_qs_container\">\n<div class=\"thrv_tw_quote\">\n<p class=\"\">Best way to handle imbalanced data in machine learning<\/p>\n<\/div>\n<p>\n\t\t\t<span><br \/>\n\t\t\t\t<i><\/i><br \/>\n\t\t\t\t<span class=\"thrv_tw_qs_button_text  thrv-inline-text tve_editable\">Click to Tweet<\/span><br \/>\n\t\t\t<\/span>\n\t\t<\/p>\n<\/div>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Before we go further, Let&#8217;s look at the topics you will learn by the end of this article.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element tve-froala fr-box fr-basic\">\n<h2 id=\"t-1596963627844\" class=\"\">What is class Imbalance in machine learning?<\/h2>\n<p>In machine learning class imbalance is the issue of <strong>target class<\/strong> distribution. Will explain why we are saying it is an issue. If the target classes are not equally distributed or not in an equal ratio, we call the data having an imbalance data issue.<\/p>\n<h3 class=\"\" id=\"t-1596963627845\">Examples of balanced and imbalanced datasets<\/h3>\n<p dir=\"ltr\">Let me give an example of a target class balanced and imbalanced datasets, which helps in understanding about class imbalance datasets.<\/p>\n<h4 class=\"\">Balanced datasets:-<\/h4>\n<ul class=\"\">\n<li>A random sampling of a coin trail<\/li>\n<li>Classifying images to cat or dog<\/li>\n<li>Sentiment analysis of movie reviews<\/li>\n<\/ul>\n<p dir=\"ltr\">Suppose you see in the above examples. For the balanced datasets, the target class distribution is nearly equal.\u00a0<\/p>\n<p dir=\"ltr\">For example, In the random coin trail, even the researchers say the probability of getting head is higher than the tail Still, the <strong>distribution<\/strong> of head and tail is nearly equal. It is the same with the movie review case too.\u00a0<\/p>\n<h4 class=\"\">Class Imbalance dataset:-<\/h4>\n<ul class=\"\">\n<li class=\" dir=\">Email spam or ham dataset<\/li>\n<li class=\" dir=\">Credit card fraud detection<\/li>\n<li class=\" dir=\">Machine components failure detections<\/li>\n<li class=\" dir=\">Network failure detections<\/li>\n<\/ul>\n<p dir=\"ltr\">But when it comes to the imbalanced dataset, the target distribution is not equal. For email spam or ham, distribution is not equal.<\/p>\n<p dir=\"ltr\">Just imagine how many emails we receive every day and how many were classified as spam. Google uses its <a href=\"https:\/\/dataaspirant.com\/build-email-spam-classification-model-spacy-python\/\" target=\"_blank\" rel=\"noopener noreferrer\">email classifier<\/a> to do that.<\/p>\n<p dir=\"ltr\">In general, out of 10 emails, we receive one will go to the spam folder, and the other emails will go to the inbox. Here the ham and spam ration is <strong>9:1<\/strong> In credit card fraud detection the ration will much lesser like <strong>9.5: 5<\/strong>\u00a0<\/p>\n<p dir=\"ltr\">By now, we are clear about imbalanced data. Now, let\u2019s learn why we need to balance data. In other words, why we need to handle the imbalanced data.<\/p>\n<h2 class=\"\" id=\"t-1596963627846\">Why we have to balance the data?<\/h2>\n<p dir=\"ltr\">The answer is quite simple, to make our predictions more <strong>accurate<\/strong>. \u00a0<\/p>\n<p dir=\"ltr\">Because if we have imbalanced data, the model is more biased to the dominant target class and tends to predict the target as the predominant target class.<\/p>\n<p dir=\"ltr\">Let say in the credit fraud detection out of 100 credit applications. Only 5 applications will fall into the fraud category. So any machine learning model will be tempted to predict the outcome against the fraud class. This means the model predicts the credit applicant is not a fraud.<\/p>\n<p dir=\"ltr\">The trained model predicting the dominant class is reasonable as all the machine learning models while learning to try to <strong>reduce the error<\/strong> as the minority classes are <strong>very less<\/strong> while leaning. It won\u2019t consider reducing the errors for the minority class and always trying to get fewer errors for predicting the majority class.<\/p>\n<p dir=\"ltr\">So to handle these kinds of issues, we need to balance the data before building the models.<\/p>\n<h2 class=\"\" id=\"t-1596963627847\">How to deal with imbalance data<\/h2>\n<p dir=\"ltr\">To deal with imbalanced data issues, we need to convert <strong>imbalance to balance<\/strong> data in a meaningful way. Then we build the <a href=\"https:\/\/dataaspirant.com\/decision-tree-algorithm-python-with-scikit-learn\/\" target=\"_blank\" class=\"tve-froala\" rel=\"noopener noreferrer\">machine learning model<\/a> on the balanced dataset.<\/p>\n<p dir=\"ltr\">In the later sections of this article, we will learn about different techniques to handle the imbalanced data.<\/p>\n<p dir=\"ltr\">Before that, we build a machine learning model on imbalanced data. Later we will apply different imbalance techniques.<\/p>\n<p dir=\"ltr\">So let\u2019s get started.<\/p>\n<h2 class=\"\" id=\"t-1596963627848\">Model on Imbalance data<\/h2>\n<h4 class=\"\">About Dataset<\/h4>\n<p dir=\"ltr\">We are taking this dataset from Kaggle, and you can download from this link\u00a0<\/p>\n<p dir=\"ltr\">The dataset contains one set of <strong>SMS<\/strong> messages in English of <strong>5,574<\/strong> messages, tagged according to ham (legitimate) or spam.<\/p>\n<p dir=\"ltr\">The files contain one message per line. Each line is composed of two columns: v1 includes the label (ham or spam), and v2 contains the raw text.columns.<\/p>\n<p dir=\"ltr\">The main task was to build a prediction model that will accurately classify which texts are <strong>spam?<\/strong><\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173d27eb838\"><span class=\"tve_image_frame\"><img loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/2-load-dataset.png?resize=613%2C519&amp;ssl=1\" class=\"tve_image wp-image-4535\" alt=\"load dataset\" data-id=\"4535\" width=\"613\" data-init-width=\"1024\" height=\"519\" data-init-height=\"867\" title=\"load dataset\" data-width=\"613\" data-height=\"519\" data-recalc-dims=\"1\"><img loading=\"lazy\" class=\"tve_image wp-image-4535\" alt=\"load dataset\" data-id=\"4535\" width=\"613\" data-init-width=\"1024\" height=\"519\" data-init-height=\"867\" title=\"load dataset\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/2-load-dataset.png?resize=613%2C519&amp;ssl=1\" data-width=\"613\" data-height=\"519\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Let\u2019s have a look at the loaded data fields.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173d27f7fa1\"><span class=\"tve_image_frame\"><img loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/3-dataset-fields.png?resize=613%2C160&amp;ssl=1\" class=\"tve_image wp-image-4537\" alt=\"dataset fields\" data-id=\"4537\" width=\"613\" data-init-width=\"2246\" height=\"160\" data-init-height=\"588\" title=\"dataset fields\" data-width=\"613\" data-height=\"160\" data-recalc-dims=\"1\"><img loading=\"lazy\" class=\"tve_image wp-image-4537\" alt=\"dataset fields\" data-id=\"4537\" width=\"613\" data-init-width=\"2246\" height=\"160\" data-init-height=\"588\" title=\"dataset fields\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/3-dataset-fields.png?resize=613%2C160&amp;ssl=1\" data-width=\"613\" data-height=\"160\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">We have the target variable v1, which contains the ham or spam and information, v2 having the actual SMS text. In addition to it, we also have some unnecessary fields. We will be <strong>removing<\/strong> them with the below code.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173d281a9dc\"><span class=\"tve_image_frame\"><img loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/4-drop-columns.png?resize=613%2C258&amp;ssl=1\" class=\"tve_image wp-image-4541\" alt=\"drop columns\" data-id=\"4541\" width=\"613\" data-init-width=\"1524\" height=\"258\" data-init-height=\"642\" title=\"drop columns\" data-width=\"613\" data-height=\"258\" data-recalc-dims=\"1\"><img loading=\"lazy\" class=\"tve_image wp-image-4541\" alt=\"drop columns\" data-id=\"4541\" width=\"613\" data-init-width=\"1524\" height=\"258\" data-init-height=\"642\" title=\"drop columns\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/4-drop-columns.png?resize=613%2C258&amp;ssl=1\" data-width=\"613\" data-height=\"258\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">We renamed the loaded data fields to<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173d2830159\"><span class=\"tve_image_frame\"><img loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/5-clean-data.png?resize=613%2C131&amp;ssl=1\" class=\"tve_image wp-image-4544\" alt=\"clean data\" data-id=\"4544\" width=\"613\" data-init-width=\"2142\" height=\"131\" data-init-height=\"458\" title=\"clean data\" data-width=\"613\" data-height=\"131\" data-recalc-dims=\"1\"><img loading=\"lazy\" class=\"tve_image wp-image-4544\" alt=\"clean data\" data-id=\"4544\" width=\"613\" data-init-width=\"2142\" height=\"131\" data-init-height=\"458\" title=\"clean data\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/5-clean-data.png?resize=613%2C131&amp;ssl=1\" data-width=\"613\" data-height=\"131\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<h4 class=\"\">Data ratio<\/h4>\n<p>Using the seaborn countplot let&#8217;s visualize the ham and spam targets ration.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173d283bcfc\"><span class=\"tve_image_frame\"><img loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/6-data-ratio.png?resize=613%2C212&amp;ssl=1\" class=\"tve_image wp-image-4546\" alt=\"data ratio\" data-id=\"4546\" width=\"613\" data-init-width=\"1996\" height=\"212\" data-init-height=\"690\" title=\"data ratio\" data-width=\"613\" data-height=\"212\" data-recalc-dims=\"1\"><img loading=\"lazy\" class=\"tve_image wp-image-4546\" alt=\"data ratio\" data-id=\"4546\" width=\"613\" data-init-width=\"1996\" height=\"212\" data-init-height=\"690\" title=\"data ratio\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/6-data-ratio.png?resize=613%2C212&amp;ssl=1\" data-width=\"613\" data-height=\"212\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<ul class=\"\">\n<li class=\"\">Ham messages : 87%<\/li>\n<li class=\"\">Spam messages : 13%<\/li>\n<\/ul>\n<p dir=\"ltr\">We can clearly see how the data was imbalanced, before going to create a model we need to do data preprocessing.<\/p>\n<h3 class=\"\" id=\"t-1596963627849\">Data Preprocessing<\/h3>\n<p dir=\"ltr\">When we are dealing with text data, first we need to preprocess the text and then convert it into vectors.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173d28638d8\"><span class=\"tve_image_frame\"><img loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/7-data-preprocessing.png?resize=613%2C671&amp;ssl=1\" class=\"tve_image wp-image-4550\" alt=\"data preprocessing\" data-id=\"4550\" width=\"613\" data-init-width=\"1524\" height=\"671\" data-init-height=\"1668\" title=\"data preprocessing\" data-width=\"613\" data-height=\"671\" data-recalc-dims=\"1\"><img loading=\"lazy\" class=\"tve_image wp-image-4550\" alt=\"data preprocessing\" data-id=\"4550\" width=\"613\" data-init-width=\"1524\" height=\"671\" data-init-height=\"1668\" title=\"data preprocessing\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/7-data-preprocessing.png?resize=613%2C671&amp;ssl=1\" data-width=\"613\" data-height=\"671\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<ul class=\"\">\n<li>\n<p dir=\"ltr\">Stemming is actually removing the suffix from a word and reducing it to its root word. First use stemming technique on text to convert into its root word.<\/p>\n<\/li>\n<li>\n<p dir=\"ltr\">We generally get text mixed up with a lot of special characters,numerical, etc. we need to take care of removing unwanted text from data. Use regular expressions to replace all the unnecessary data with spaces<\/p>\n<\/li>\n<li>\n<p dir=\"ltr\">Convert all the text into lowercase to avoid getting different vectors for the same word . Eg: and, And &#8212;&#8212;&#8212;&#8212;&gt; and<\/p>\n<\/li>\n<li>\n<p dir=\"ltr\">Remove stopWords &#8211; \u201cstop words\u201d\u00a0 typically\u00a0 refers to the most common words in a language, Eg: he, is, at etc.\u00a0 We need to filter stopwords<\/p>\n<\/li>\n<\/ul>\n<ul class=\"\">\n<li>\n<p dir=\"ltr\">Split the sentence into words<\/p>\n<\/li>\n<li>\n<p dir=\"ltr\">Extract the text except for stopwords<\/p>\n<\/li>\n<li>\n<p dir=\"ltr\">Again join them into sentences<\/p>\n<\/li>\n<\/ul>\n<ul class=\"\">\n<li>\n<p dir=\"ltr\">Append the cleaned text into a list (corpus)<\/p>\n<\/li>\n<li>\n<p dir=\"ltr\">Now our text is ready , convert the text into vectors using Countvectorizer<\/p>\n<\/li>\n<li>\n<p dir=\"ltr\">Convert target label into categorical<\/p>\n<\/li>\n<\/ul>\n<h3 class=\"\" id=\"t-1596963627850\">Model Creation<\/h3>\n<p>First, we simply create the model with unbalanced data, then after try with different balancing techniques.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173d287a07b\"><span class=\"tve_image_frame\"><img loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/8-model-building.png?resize=613%2C323&amp;ssl=1\" class=\"tve_image wp-image-4555\" alt=\"model building\" data-id=\"4555\" width=\"613\" data-init-width=\"1524\" height=\"323\" data-init-height=\"804\" title=\"model building\" data-width=\"613\" data-height=\"323\" data-recalc-dims=\"1\"><img loading=\"lazy\" class=\"tve_image wp-image-4555\" alt=\"model building\" data-id=\"4555\" width=\"613\" data-init-width=\"1524\" height=\"323\" data-init-height=\"804\" title=\"model building\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/8-model-building.png?resize=613%2C323&amp;ssl=1\" data-width=\"613\" data-height=\"323\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p>Let us check the accuracy of the model.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173d288406b\"><span class=\"tve_image_frame\"><img loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/9-accuracy-of-model.png?resize=613%2C85&amp;ssl=1\" class=\"tve_image wp-image-4558\" alt=\"accuracy of model\" data-id=\"4558\" width=\"613\" data-init-width=\"1994\" height=\"85\" data-init-height=\"278\" title=\"accuracy of model\" data-width=\"613\" data-height=\"85\" data-recalc-dims=\"1\"><img loading=\"lazy\" class=\"tve_image wp-image-4558\" alt=\"accuracy of model\" data-id=\"4558\" width=\"613\" data-init-width=\"1994\" height=\"85\" data-init-height=\"278\" title=\"accuracy of model\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/9-accuracy-of-model.png?resize=613%2C85&amp;ssl=1\" data-width=\"613\" data-height=\"85\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">We got an accuracy of <strong>0.98<\/strong>, which was almost biased.<\/p>\n<p dir=\"ltr\">Now we will learn how to handle imbalance data with different imbalanced techniques in the next section of the article.<\/p>\n<h2 class=\"\" id=\"t-1596963627851\">Techniques for handling imbalanced data<\/h2>\n<p>For handling imbalance data we are having many other ways, In this article, we will learn about the below techniques along with the code implementation.<\/p>\n<ol class=\"\">\n<li class=\" dir=\">Oversampling<\/li>\n<li class=\" dir=\">Undersampling<\/li>\n<li class=\" dir=\">Ensemble Techniques<\/li>\n<\/ol>\n<p>In this article we will be focusing only on the <strong>first 2 methods<\/strong> for handling imbalance data.<\/p>\n<h3 class=\"\" id=\"t-1596963627852\">OverSampling<\/h3>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173d28a993c\"><span class=\"tve_image_frame\"><img loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/10-oversampling.png?resize=613%2C324&amp;ssl=1\" class=\"tve_image wp-image-4564\" alt=\"oversampling\" data-id=\"4564\" width=\"613\" data-init-width=\"1862\" height=\"324\" data-init-height=\"986\" title=\"oversampling\" data-width=\"613\" data-height=\"324\" data-recalc-dims=\"1\"><img loading=\"lazy\" class=\"tve_image wp-image-4564\" alt=\"oversampling\" data-id=\"4564\" width=\"613\" data-init-width=\"1862\" height=\"324\" data-init-height=\"986\" title=\"oversampling\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/10-oversampling.png?resize=613%2C324&amp;ssl=1\" data-width=\"613\" data-height=\"324\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">In oversampling, we increase the number of samples in minority class to match up to the number of samples of the majority class.<\/p>\n<p dir=\"ltr\">In simple terms, you take the <strong>minority class<\/strong> and try to create new samples that could match up to the length of the <strong>majority samples<\/strong>.<\/p>\n<p dir=\"ltr\">Let me explain in a much better way.<\/p>\n<p dir=\"ltr\">E.g., Suppose we have a data with 100 labels with 0\u2019s and 900 labels with 1\u2019s, here the minority class 0\u2019s, what we do is we increase the data 9:1 ratio, i.e., for everyone data point it will increase 9 times results in creating new 9 data points on that top of one point.<\/p>\n<p dir=\"ltr\"><strong>Mathematically:<\/strong><\/p>\n<p dir=\"ltr\">1 label &#8212;&#8212;&#8212;&#8212;&#8211;&gt; 900 data\u00a0 points<\/p>\n<p dir=\"ltr\">0 label &#8212;&#8212;&#8212;&#8212;&#8212;&gt; 100 data points<\/p>\n<p dir=\"ltr\">\u00a0+ 800 points<\/p>\n<p dir=\"ltr\">&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;<\/p>\n<p dir=\"ltr\">\u00a0 \u00a0 \u00a0 900 data points<\/p>\n<p dir=\"ltr\">Now the data ratio is 1:1 ,<\/p>\n<p dir=\"ltr\">1 label &#8212;&#8212;&gt;900 data points<\/p>\n<p dir=\"ltr\">0 label &#8212;&#8212;&gt; 900 data points<\/p>\n<h3 class=\"\" id=\"t-1596963627853\">Oversampling Implementation<\/h3>\n<p dir=\"ltr\">We can implement in two ways,<\/p>\n<ol class=\"\">\n<li>RandomOverSampler method<\/li>\n<li>SMOTETomek method<\/li>\n<\/ol>\n<p dir=\"ltr\">First, we have to install imblearn library, to install enter below command in cmd<\/p>\n<p dir=\"ltr\">Command: \u00a0<strong>pip install imbalanced-learn<\/strong><\/p>\n<h4 class=\"\">RandomOverSampler<\/h4>\n<p dir=\"ltr\">It is the most sophisticated method of oversampling to randomly sample the minority classes and simply duplicate the sampled observations.<\/p>\n<h4 class=\"\">RandomOversampler Implementation in python<\/h4>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173d28cc59a\"><span class=\"tve_image_frame\"><img loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/11-random-over-sampler.png?resize=613%2C278&amp;ssl=1\" class=\"tve_image wp-image-4571\" alt=\"random over sampler\" data-id=\"4571\" width=\"613\" data-init-width=\"1296\" height=\"278\" data-init-height=\"588\" title=\"random over sampler\" data-width=\"613\" data-height=\"278\" data-recalc-dims=\"1\"><img loading=\"lazy\" class=\"tve_image wp-image-4571\" alt=\"random over sampler\" data-id=\"4571\" width=\"613\" data-init-width=\"1296\" height=\"278\" data-init-height=\"588\" title=\"random over sampler\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/11-random-over-sampler.png?resize=613%2C278&amp;ssl=1\" data-width=\"613\" data-height=\"278\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p>\u00a0 Here, <\/p>\n<ul class=\"\">\n<li class=\"\">\n<strong>x<\/strong> is an independent features\u00a0<\/li>\n<li class=\"\">\n<strong>y<\/strong> is a dependent feature\u00a0<\/li>\n<\/ul>\n<p dir=\"ltr\">If you want to check the samples count before and after oversampling, run the below code.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173d28e1761\"><span class=\"tve_image_frame\"><img loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/12-random-over-sampler-output.png?resize=613%2C74&amp;ssl=1\" class=\"tve_image wp-image-4575\" alt=\"random over sampler output\" data-id=\"4575\" width=\"613\" data-init-width=\"1994\" height=\"74\" data-init-height=\"242\" title=\"random over sampler output\" data-width=\"613\" data-height=\"74\" data-recalc-dims=\"1\"><img loading=\"lazy\" class=\"tve_image wp-image-4575\" alt=\"random over sampler output\" data-id=\"4575\" width=\"613\" data-init-width=\"1994\" height=\"74\" data-init-height=\"242\" title=\"random over sampler output\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/12-random-over-sampler-output.png?resize=613%2C74&amp;ssl=1\" data-width=\"613\" data-height=\"74\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<h4 class=\"\" id=\"t-1596963627854\">SMOTETomek<\/h4>\n<p dir=\"ltr\">Synthetic Minority Over-sampling Technique(<strong>SMOTE<\/strong>) is a technique that generates new observations by interposing between observations in the existing data. <\/p>\n<p dir=\"ltr\">In Simple terms, It is a technique used to generate new data points for the minority classes based on existing data.\u00a0<\/p>\n<h4 class=\"\">Smotetomek implementation in python<\/h4>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173d28f031d\"><span class=\"tve_image_frame\"><img loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/13-smotetomek-code.png?resize=613%2C340&amp;ssl=1\" class=\"tve_image wp-image-4578\" alt=\"smotetomek code\" data-id=\"4578\" width=\"613\" data-init-width=\"1060\" height=\"340\" data-init-height=\"588\" title=\"smotetomek code\" data-width=\"613\" data-height=\"340\" data-recalc-dims=\"1\"><img loading=\"lazy\" class=\"tve_image wp-image-4578\" alt=\"smotetomek code\" data-id=\"4578\" width=\"613\" data-init-width=\"1060\" height=\"340\" data-init-height=\"588\" title=\"smotetomek code\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/13-smotetomek-code.png?resize=613%2C340&amp;ssl=1\" data-width=\"613\" data-height=\"340\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Here , <\/p>\n<ul class=\"\">\n<li class=\"\">\n<strong>x<\/strong> is a set of independent features<\/li>\n<li class=\"\">\n<strong>y<\/strong> is a dependent feature\u00a0<\/li>\n<\/ul>\n<p dir=\"ltr\">If you want to check the samples count before and after oversampling, run the below code.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173d28fbd3e\"><span class=\"tve_image_frame\"><img loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/14-smotetomek-output.png?resize=613%2C69&amp;ssl=1\" class=\"tve_image wp-image-4581\" alt=\"smotetomek output\" data-id=\"4581\" width=\"613\" data-init-width=\"1978\" height=\"69\" data-init-height=\"224\" title=\"smotetomek output\" data-width=\"613\" data-height=\"69\" data-recalc-dims=\"1\"><img loading=\"lazy\" class=\"tve_image wp-image-4581\" alt=\"smotetomek output\" data-id=\"4581\" width=\"613\" data-init-width=\"1978\" height=\"69\" data-init-height=\"224\" title=\"smotetomek output\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/14-smotetomek-output.png?resize=613%2C69&amp;ssl=1\" data-width=\"613\" data-height=\"69\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Now let\u2019s implement the same model, with the oversampled data.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173d290bc2d\"><span class=\"tve_image_frame\"><img loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/15-model-with-random-oversampled.png?resize=613%2C367&amp;ssl=1\" class=\"tve_image wp-image-4585\" alt=\"model with random oversampled\" data-id=\"4585\" width=\"613\" data-init-width=\"1524\" height=\"367\" data-init-height=\"912\" title=\"model with random oversampled\" data-width=\"613\" data-height=\"367\" data-recalc-dims=\"1\"><img loading=\"lazy\" class=\"tve_image wp-image-4585\" alt=\"model with random oversampled\" data-id=\"4585\" width=\"613\" data-init-width=\"1524\" height=\"367\" data-init-height=\"912\" title=\"model with random oversampled\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/15-model-with-random-oversampled.png?resize=613%2C367&amp;ssl=1\" data-width=\"613\" data-height=\"367\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Let\u2019s check the accuracy of the model.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173d291521a\"><span class=\"tve_image_frame\"><img loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/16-random-oversampling-model-accuracy.png?resize=613%2C86&amp;ssl=1\" class=\"tve_image wp-image-4587\" alt=\"random oversampling model accuracy\" data-id=\"4587\" width=\"613\" data-init-width=\"2004\" height=\"86\" data-init-height=\"280\" title=\"random oversampling model accuracy\" data-width=\"613\" data-height=\"86\" data-recalc-dims=\"1\"><img loading=\"lazy\" class=\"tve_image wp-image-4587\" alt=\"random oversampling model accuracy\" data-id=\"4587\" width=\"613\" data-init-width=\"2004\" height=\"86\" data-init-height=\"280\" title=\"random oversampling model accuracy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/16-random-oversampling-model-accuracy.png?resize=613%2C86&amp;ssl=1\" data-width=\"613\" data-height=\"86\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element tve-froala fr-box fr-basic\">\n<p dir=\"ltr\">We can see we got a very good accuracy for balanced data, tp and tf are increased. Where\u00a0<\/p>\n<ul class=\"\">\n<li class=\"\">TP: Ture Positive<\/li>\n<li class=\"\">TF: Ture Negative<\/li>\n<\/ul>\n<p>The tp and tf are the components from the <a href=\"https:\/\/dataaspirant.com\/confusion-matrix-sklearn-python\/\" target=\"_blank\" class=\"tve-froala\" rel=\"noopener noreferrer\">confusion matrix<\/a>.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv-page-section thrv-lp-block\" data-inherit-lp-settings=\"1\" data-css=\"tve-u-173d292bdd8\">\n<div class=\"tve-page-section-in tve_empty_dropzone  \" data-css=\"tve-u-173d292bf9d\">\n<h3 class=\"\" id=\"t-1596963627856\">Oversampling pros and cons<\/h3>\n<div class=\"thrv_wrapper thrv_text_element\" data-css=\"tve-u-173d292bddc\">\n<p class=\"tcb-global-text-\" data-css=\"tve-u-173d292bddd\">Below are the listed pros and cons of using the oversampling technique.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv-columns dynamic-group-kbuvrect\" data-css=\"tve-u-173d292bdde\">\n<div class=\"tcb-flex-row v-2 tcb--cols--2\" data-css=\"tve-u-173d292bddf\">\n<div class=\"tcb-flex-col\">\n<div class=\"tcb-col dynamic-group-kbuvrcvb\" data-css=\"tve-u-173d292bde0\">\n<div class=\"thrv_wrapper thrv_contentbox_shortcode thrv-content-box tve-elem-default-pad dynamic-group-kbuvrbgn\" data-css=\"tve-u-173d292bde1\">\n<div class=\"tve-cb\">\n<div class=\"thrv_wrapper thrv-styled_list tcb-icon-display dynamic-group-kbuvr94e\" data-icon-code=\"icon-check-circle-solid\" data-css=\"tve-u-173d292bde7\">\n<ul class=\"tcb-styled-list\">\n<li class=\"thrv-styled-list-item dynamic-group-kbksn48l\" data-css=\"tve-u-173d292bde8\">\n<p><span class=\"thrv-advanced-inline-text tve_editable tcb-styled-list-icon-text tcb-no-delete tcb-no-save dynamic-group-kbksmpx3\" data-css=\"tve-u-173d292bdea\">This method doesn\u2019t lead to information loss.<\/span>\n<\/li>\n<li class=\"thrv-styled-list-item dynamic-group-kbksn48l\" data-css=\"tve-u-173d292bdeb\">\n<p><span class=\"thrv-advanced-inline-text tve_editable tcb-styled-list-icon-text tcb-no-delete tcb-no-save dynamic-group-kbksmpx3\" data-css=\"tve-u-173d292bded\">Performs\u00a0 well and gives good accuracy.<\/span>\n<\/li>\n<li class=\"thrv-styled-list-item dynamic-group-kbksn48l\" data-css=\"tve-u-173d292bdee\">\n<p><span class=\"thrv-advanced-inline-text tve_editable tcb-styled-list-icon-text tcb-no-delete tcb-no-save dynamic-group-kbksmpx3\" data-css=\"tve-u-173d292bdf0\">It creates new synthetic data points with the nearest neighbours from existing data.<\/span>\n<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"tcb-flex-col\">\n<div class=\"tcb-col dynamic-group-kbuvrcvb\" data-css=\"tve-u-173d292bdf4\">\n<div class=\"thrv_wrapper thrv_contentbox_shortcode thrv-content-box tve-elem-default-pad dynamic-group-kbuvrbgn\" data-css=\"tve-u-173d292bdf5\">\n<div class=\"tve-cb\">\n<div class=\"thrv_wrapper thrv-styled_list dynamic-group-kbuvr94e\" data-icon-code=\"icon-times-circle-solid\" data-css=\"tve-u-173d292bdfb\">\n<ul class=\"tcb-styled-list\">\n<li class=\"thrv-styled-list-item dynamic-group-kbksn48l\" data-css=\"tve-u-173d292bdfc\">\n<p><span class=\"thrv-advanced-inline-text tve_editable tcb-styled-list-icon-text tcb-no-delete tcb-no-save dynamic-group-kbksmpx3\" data-css=\"tve-u-173d292bdfe\">Increase the size of data takes high time for training.<\/span>\n<\/li>\n<li class=\"thrv-styled-list-item dynamic-group-kbksn48l\" data-css=\"tve-u-173d292bdff\">\n<p><span class=\"thrv-advanced-inline-text tve_editable tcb-styled-list-icon-text tcb-no-delete tcb-no-save dynamic-group-kbksmpx3\" data-css=\"tve-u-173d292be01\">It may also lead to overfitting since it is replicating the minor classes.<\/span>\n<\/li>\n<li class=\"thrv-styled-list-item dynamic-group-kbksn48l\" data-css=\"tve-u-173d292be02\">\n<p><span class=\"thrv-advanced-inline-text tve_editable tcb-styled-list-icon-text tcb-no-delete tcb-no-save dynamic-group-kbksmpx3\" data-css=\"tve-u-173d292be04\">Need extra\u00a0storage.<\/span>\n<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<h3 class=\"\" id=\"t-1596963627855\">UnderSampling<\/h3>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173d296ba31\"><span class=\"tve_image_frame\"><img loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/17-undersampling.png?resize=613%2C318&amp;ssl=1\" class=\"tve_image wp-image-4595\" alt=\"undersampling\" data-id=\"4595\" width=\"613\" data-init-width=\"2242\" height=\"318\" data-init-height=\"1164\" title=\"undersampling\" data-width=\"613\" data-height=\"318\" data-recalc-dims=\"1\"><img loading=\"lazy\" class=\"tve_image wp-image-4595\" alt=\"undersampling\" data-id=\"4595\" width=\"613\" data-init-width=\"2242\" height=\"318\" data-init-height=\"1164\" title=\"undersampling\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/17-undersampling.png?resize=613%2C318&amp;ssl=1\" data-width=\"613\" data-height=\"318\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p>In undersampling, we <strong>decrease<\/strong> the number of samples in the majority class to match the number of samples of the minority class. <\/p>\n<p>In brief, you take the majority class and try to create new samples that match the length of the minority samples. <\/p>\n<p>Let me explain in a much better way<\/p>\n<p>E.g., Suppose we have a data with 100 labels with 0\u2019s and 900 labels with 1\u2019s, here the minority class 0\u2019s, what we do is we balance the data from 9:1 ratio to 1:1 ratio i.e., We randomly select 100 data points out of 900 data points in majority class. Results in 1: 1 ratio, i.e.,<\/p>\n<p dir=\"ltr\">1 label &#8212;&#8212;&#8212;&#8212;&#8212;-&gt; 100 data points<\/p>\n<p dir=\"ltr\">0 label &#8212;&#8212;&#8212;&#8212;&#8212;&#8211;&gt; 100 data points<\/p>\n<h3 class=\"\" id=\"t-1596963627860\">Undersampling Implementation<\/h3>\n<p dir=\"ltr\">We can implement in two <strong>different<\/strong> ways,<\/p>\n<ol class=\"\">\n<li class=\" dir=\">RandomunderSampler method<\/li>\n<li class=\" dir=\">NearMiss \u00a0method<\/li>\n<\/ol>\n<h4 class=\"\">Random undersampling Implementation<\/h4>\n<p dir=\"ltr\">It simply samples the majority class at random until it reaches a similar number of observations as the minority classes.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173d2991418\"><span class=\"tve_image_frame\"><img loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/18-random-under-sample-code.png?resize=613%2C271&amp;ssl=1\" class=\"tve_image wp-image-4603\" alt=\"random under sample code\" data-id=\"4603\" width=\"613\" data-init-width=\"1332\" height=\"271\" data-init-height=\"588\" title=\"random under sample code\" data-width=\"613\" data-height=\"271\" data-recalc-dims=\"1\"><img loading=\"lazy\" class=\"tve_image wp-image-4603\" alt=\"random under sample code\" data-id=\"4603\" width=\"613\" data-init-width=\"1332\" height=\"271\" data-init-height=\"588\" title=\"random under sample code\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/18-random-under-sample-code.png?resize=613%2C271&amp;ssl=1\" data-width=\"613\" data-height=\"271\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Here, <\/p>\n<ul class=\"\">\n<li class=\"\">\n<strong>x<\/strong> is independent features.<\/li>\n<li class=\"\">\n<strong>y<\/strong> is a dependent feature.<\/li>\n<\/ul>\n<p dir=\"ltr\">If you want to check the samples count before and after undersampling, run the below code.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173d299c5b9\"><span class=\"tve_image_frame\"><img loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/19-random-under-sampler-output.png?resize=613%2C68&amp;ssl=1\" class=\"tve_image wp-image-4605\" alt=\"random under sampler output\" data-id=\"4605\" width=\"613\" data-init-width=\"1988\" height=\"68\" data-init-height=\"220\" title=\"random under sampler output\" data-width=\"613\" data-height=\"68\" data-recalc-dims=\"1\"><img loading=\"lazy\" class=\"tve_image wp-image-4605\" alt=\"random under sampler output\" data-id=\"4605\" width=\"613\" data-init-width=\"1988\" height=\"68\" data-init-height=\"220\" title=\"random under sampler output\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/19-random-under-sampler-output.png?resize=613%2C68&amp;ssl=1\" data-width=\"613\" data-height=\"68\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<h4 class=\"\">NearMiss Implementation<\/h4>\n<p dir=\"ltr\">It selects samples from the majority class for which the average distance of the N closet samples of a majority class is smallest.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173d29af5cd\"><span class=\"tve_image_frame\"><img loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/20-under-sampling-with-nearness.png?resize=613%2C313&amp;ssl=1\" class=\"tve_image wp-image-4610\" alt=\"under sampling with nearness\" data-id=\"4610\" width=\"613\" data-init-width=\"1152\" height=\"313\" data-init-height=\"588\" title=\"under sampling with nearness\" data-width=\"613\" data-height=\"313\" data-recalc-dims=\"1\"><img loading=\"lazy\" class=\"tve_image wp-image-4610\" alt=\"under sampling with nearness\" data-id=\"4610\" width=\"613\" data-init-width=\"1152\" height=\"313\" data-init-height=\"588\" title=\"under sampling with nearness\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/20-under-sampling-with-nearness.png?resize=613%2C313&amp;ssl=1\" data-width=\"613\" data-height=\"313\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Here, <\/p>\n<ul class=\"\">\n<li class=\"\">x is independent features<\/li>\n<li class=\"\">y is a dependent feature<\/li>\n<\/ul>\n<p dir=\"ltr\">If you want to check the samples count before and after undersampling, run the below code.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173d29b7560\"><span class=\"tve_image_frame\"><img loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/21-under-sampling-nearmiss-output.png?resize=613%2C69&amp;ssl=1\" class=\"tve_image wp-image-4612\" alt=\"under sampling nearmiss output\" data-id=\"4612\" width=\"613\" data-init-width=\"1994\" height=\"69\" data-init-height=\"224\" title=\"under sampling nearmiss output\" data-width=\"613\" data-height=\"69\" data-recalc-dims=\"1\"><img loading=\"lazy\" class=\"tve_image wp-image-4612\" alt=\"under sampling nearmiss output\" data-id=\"4612\" width=\"613\" data-init-width=\"1994\" height=\"69\" data-init-height=\"224\" title=\"under sampling nearmiss output\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/21-under-sampling-nearmiss-output.png?resize=613%2C69&amp;ssl=1\" data-width=\"613\" data-height=\"69\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Now we will implement the model using the undersampling data.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173d29c415a\"><span class=\"tve_image_frame\"><img loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/22-model-with-under-sampling-data.png?resize=613%2C323&amp;ssl=1\" class=\"tve_image wp-image-4615\" alt=\"model with under sampling data\" data-id=\"4615\" width=\"613\" data-init-width=\"1524\" height=\"323\" data-init-height=\"804\" title=\"model with under sampling data\" data-width=\"613\" data-height=\"323\" data-recalc-dims=\"1\"><img loading=\"lazy\" class=\"tve_image wp-image-4615\" alt=\"model with under sampling data\" data-id=\"4615\" width=\"613\" data-init-width=\"1524\" height=\"323\" data-init-height=\"804\" title=\"model with under sampling data\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/22-model-with-under-sampling-data.png?resize=613%2C323&amp;ssl=1\" data-width=\"613\" data-height=\"323\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Now let\u2019s check the accuracy of the model.<\/p>\n<\/div>\n<div class=\"thrv_wrapper tve_image_caption\" data-css=\"tve-u-173d29cf1f7\"><span class=\"tve_image_frame\"><img loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/23-unders-sampling-model-accuracy.png?resize=613%2C89&amp;ssl=1\" class=\"tve_image wp-image-4618\" alt=\"unders sampling model accuracy\" data-id=\"4618\" width=\"613\" data-init-width=\"2018\" height=\"89\" data-init-height=\"294\" title=\"unders sampling model accuracy\" data-width=\"613\" data-height=\"89\" data-recalc-dims=\"1\"><img loading=\"lazy\" class=\"tve_image wp-image-4618\" alt=\"unders sampling model accuracy\" data-id=\"4618\" width=\"613\" data-init-width=\"2018\" height=\"89\" data-init-height=\"294\" title=\"unders sampling model accuracy\" src=\"https:\/\/i1.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/23-unders-sampling-model-accuracy.png?resize=613%2C89&amp;ssl=1\" data-width=\"613\" data-height=\"89\" data-recalc-dims=\"1\"><\/span><\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<p dir=\"ltr\">Under-sampling gives less accuracy for smaller datasets because you are actually dropping the information. Use this method only if one has a huge dataset.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv-page-section thrv-lp-block\" data-inherit-lp-settings=\"1\" data-css=\"tve-u-173d29dc90d\">\n<div class=\"tve-page-section-in tve_empty_dropzone  \" data-css=\"tve-u-173d29dcb26\">\n<h3 class=\"\" id=\"t-1596963627862\">Undersampling pros and cons<\/h3>\n<div class=\"thrv_wrapper thrv_text_element\" data-css=\"tve-u-173d29dc911\">\n<p class=\"tcb-global-text-\" data-css=\"tve-u-173d29dc912\">Below are the listed pros and corns of using the undersampling techniques<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv-columns dynamic-group-kbuvrect\" data-css=\"tve-u-173d29dc913\">\n<div class=\"tcb-flex-row v-2 tcb--cols--2\" data-css=\"tve-u-173d29dc914\">\n<div class=\"tcb-flex-col\">\n<div class=\"tcb-col dynamic-group-kbuvrcvb\" data-css=\"tve-u-173d29dc915\">\n<div class=\"thrv_wrapper thrv_contentbox_shortcode thrv-content-box tve-elem-default-pad dynamic-group-kbuvrbgn\" data-css=\"tve-u-173d29dc916\">\n<div class=\"tve-cb\">\n<div class=\"thrv_wrapper thrv-styled_list tcb-icon-display dynamic-group-kbuvr94e\" data-icon-code=\"icon-check-circle-solid\" data-css=\"tve-u-173d29dc91c\">\n<ul class=\"tcb-styled-list\">\n<li class=\"thrv-styled-list-item dynamic-group-kbksn48l\" data-css=\"tve-u-173d29dc91d\">\n<p><span class=\"thrv-advanced-inline-text tve_editable tcb-styled-list-icon-text tcb-no-delete tcb-no-save dynamic-group-kbksmpx3\" data-css=\"tve-u-173d29dc91f\">Reduces storage problems, easy to train<\/span>\n<\/li>\n<li class=\"thrv-styled-list-item dynamic-group-kbksn48l\" data-css=\"tve-u-173d29dc920\">\n<p><span class=\"thrv-advanced-inline-text tve_editable tcb-styled-list-icon-text tcb-no-delete tcb-no-save dynamic-group-kbksmpx3\" data-css=\"tve-u-173d29dc922\">In most cases it creates a balanced subset that carries the greatest potential for representing the larger group as a whole.<\/span>\n<\/li>\n<li class=\"thrv-styled-list-item dynamic-group-kbksn48l\" data-css=\"tve-u-173d29dc923\">\n<p><span class=\"thrv-advanced-inline-text tve_editable tcb-styled-list-icon-text tcb-no-delete tcb-no-save dynamic-group-kbksmpx3\" data-css=\"tve-u-173d29dc925\">It produces a simple random sample which is much less complicated than other techniques.<\/span>\n<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"tcb-flex-col\">\n<div class=\"tcb-col dynamic-group-kbuvrcvb\" data-css=\"tve-u-173d29dc929\">\n<div class=\"thrv_wrapper thrv_contentbox_shortcode thrv-content-box tve-elem-default-pad dynamic-group-kbuvrbgn\" data-css=\"tve-u-173d29dc92a\">\n<div class=\"tve-cb\">\n<div class=\"thrv_wrapper thrv-styled_list dynamic-group-kbuvr94e\" data-icon-code=\"icon-times-circle-solid\" data-css=\"tve-u-173d29dc930\">\n<ul class=\"tcb-styled-list\">\n<li class=\"thrv-styled-list-item dynamic-group-kbksn48l\" data-css=\"tve-u-173d29dc931\">\n<p><span class=\"thrv-advanced-inline-text tve_editable tcb-styled-list-icon-text tcb-no-delete tcb-no-save dynamic-group-kbksmpx3\" data-css=\"tve-u-173d29dc933\">It can ignore potentially useful information which could be important for building\u00a0 classifiers.<\/span>\n<\/li>\n<li class=\"thrv-styled-list-item dynamic-group-kbksn48l\" data-css=\"tve-u-173d29dc934\">\n<p><span class=\"thrv-advanced-inline-text tve_editable tcb-styled-list-icon-text tcb-no-delete tcb-no-save dynamic-group-kbksmpx3\" data-css=\"tve-u-173d29dc936\">The sample chosen by random under-sampling may be a biased sample, resulting in inaccurate results with the actual test data.<\/span>\n<\/li>\n<li class=\"thrv-styled-list-item dynamic-group-kbksn48l\" data-css=\"tve-u-173d29dc937\">\n<p><span class=\"thrv-advanced-inline-text tve_editable tcb-styled-list-icon-text tcb-no-delete tcb-no-save dynamic-group-kbksmpx3\" data-css=\"tve-u-173d29dc939\">Loss of useful information of the majority class.<\/span>\n<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element tve-froala fr-box fr-basic\">\n<h2 class=\"\" id=\"t-1597029099964\">When to use oversampling VS undersampling<\/h2>\n<p>We have a fair amount of knowledge on these two data imbalance handling techniques, but we use them as both the methods are for handling the imbalanced data issue. <\/p>\n<ul>\n<li>\n<strong>Oversampling:<\/strong> We will use oversampling when we are having a limited amount of data.<\/li>\n<li>\n<strong>Undersampling:<\/strong> We will use undersampling when we have huge data and undersampling the majority call won&#8217;t effect the data.\n<\/li>\n<\/ul>\n<h2 class=\"\" id=\"t-1596963627866\">Complete Code<\/h2>\n<p>The complete code is placed below, you can also fork the code in our\u00a0<a href=\"https:\/\/github.com\/saimadhu-polamuri\/DataAspirant_codes\/tree\/master\/handle_imbalance_data\" target=\"_blank\" class=\"tve-froala fr-basic\" rel=\"noopener noreferrer\">Github repo<\/a>.<\/p>\n<\/div>\n<div class=\"thrv_wrapper thrv_text_element\">\n<h2 class=\"\" id=\"t-1596963627861\">Conclusion<\/h2>\n<p dir=\"ltr\">When handling imbalanced datasets, there is no one right solution to improve the accuracy of the prediction model. We need to try out multiple methods to figure out the best-suited sampling techniques for the dataset. <\/p>\n<p dir=\"ltr\">Depending on the characteristics of the imbalanced data set, the most effective techniques will vary. In most cases, synthetic techniques like SMOTE will outperform conventional oversampling and undersampling methods.<\/p>\n<p dir=\"ltr\">For better results, we can use synthetic sampling methods like SMOTE and advanced boosting and ensemble algorithms.<\/p>\n<h4 class=\"\">Recommended Courses<\/h4>\n<\/div>\n<div class=\"thrv_wrapper thrv-page-section thrv-lp-block\" data-inherit-lp-settings=\"1\" data-css=\"tve-u-173d2aee198\">\n<div class=\"tve-page-section-in tve_empty_dropzone  \" data-css=\"tve-u-173d2aee40e\">\n<div class=\"thrv_wrapper thrv-columns dynamic-group-kbt3q0q7\" data-css=\"tve-u-173d2aee19a\">\n<div class=\"tcb-flex-row v-2 tcb--cols--3 tcb-medium-no-wrap tcb-mobile-wrap m-edit\" data-css=\"tve-u-173d2aee19b\">\n<div class=\"tcb-flex-col\">\n<div class=\"tcb-col dynamic-group-kbt3pyfd\" data-css=\"tve-u-173d2aee19c\">\n<div class=\"thrv_wrapper thrv_contentbox_shortcode thrv-content-box tve-elem-default-pad dynamic-group-kbt3pwhk\" data-css=\"tve-u-173d2aee19d\">\n<div class=\"tve-cb\">\n<div class=\"thrv_wrapper tve_image_caption dynamic-group-kbt3pu4z\" data-css=\"tve-u-173d2aee1a8\"><span class=\"tve_image_frame\"><a href=\"https:\/\/dataaspirant.com\/recommends\/ds-courses\/educative-machine-learning-course\/\" target=\"_blank\" class=\"hasimg thirstylinkimg\" title=\"Educative Machine Learning course\" rel=\"nofollow noopener noreferrer\" data-linkid=\"4692\"><img loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/educative-machine-learning.png?resize=172%2C172&amp;ssl=1\" class=\"tve_image wp-image-4693\" alt=\"educative-machine-learning\" data-id=\"4693\" width=\"172\" data-init-width=\"150\" height=\"172\" data-init-height=\"150\" title=\"educative-machine-learning\" data-width=\"172\" data-height=\"172\" data-css=\"tve-u-173d2aee1a9\" data-link-wrap=\"true\" data-recalc-dims=\"1\"><img loading=\"lazy\" class=\"tve_image wp-image-4693\" alt=\"educative-machine-learning\" data-id=\"4693\" width=\"172\" data-init-width=\"150\" height=\"172\" data-init-height=\"150\" title=\"educative-machine-learning\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/educative-machine-learning.png?resize=172%2C172&amp;ssl=1\" data-width=\"172\" data-height=\"172\" data-css=\"tve-u-173d2aee1a9\" data-link-wrap=\"true\" data-recalc-dims=\"1\"><\/a><span class=\"tve-image-overlay\"><\/span><\/span><\/div>\n<h4 class=\"\" data-css=\"tve-u-173d2aee1ab\">Machine Learning For Engineers<\/h4>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"tcb-flex-col\">\n<div class=\"tcb-col dynamic-group-kbt3pyfd\" data-css=\"tve-u-173d2aee19c\">\n<div class=\"thrv_wrapper thrv_contentbox_shortcode thrv-content-box tve-elem-default-pad dynamic-group-kbt3pwhk\" data-css=\"tve-u-173d2aee1b8\">\n<div class=\"tve-cb\">\n<div class=\"thrv_wrapper tve_image_caption dynamic-group-kbt3pu4z\" data-css=\"tve-u-173d2aee1b9\"><span class=\"tve_image_frame\"><a href=\"https:\/\/dataaspirant.com\/recommends\/data-science-courses\/superviesed-learning-scikit-learn-datacamp\/\" target=\"_blank\" class=\"hasimg thirstylinkimg\" title=\"Superviesed-Learning-With-Scikit-learn-Datacamp\" rel=\"nofollow noopener noreferrer\" data-linkid=\"3157\"><img loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/supervised-learning.png?resize=172%2C172&amp;ssl=1\" class=\"tve_image wp-image-4696\" alt=\"supervised learning\" data-id=\"4696\" width=\"172\" data-init-width=\"464\" height=\"172\" data-init-height=\"482\" title=\"supervised learning\" data-width=\"172\" data-height=\"172\" data-css=\"tve-u-173d2aee1ba\" data-link-wrap=\"true\" data-recalc-dims=\"1\"><img loading=\"lazy\" class=\"tve_image wp-image-4696\" alt=\"supervised learning\" data-id=\"4696\" width=\"172\" data-init-width=\"464\" height=\"172\" data-init-height=\"482\" title=\"supervised learning\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/supervised-learning.png?resize=172%2C172&amp;ssl=1\" data-width=\"172\" data-height=\"172\" data-css=\"tve-u-173d2aee1ba\" data-link-wrap=\"true\" data-recalc-dims=\"1\"><\/a><span class=\"tve-image-overlay\"><\/span><\/span><\/div>\n<h4 class=\"\" data-css=\"tve-u-173d2aee1bc\">Supervised Learning Algorithms<\/h4>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"tcb-flex-col\">\n<div class=\"tcb-col dynamic-group-kbt3pyfd\" data-css=\"tve-u-173d2aee19c\">\n<div class=\"thrv_wrapper thrv_contentbox_shortcode thrv-content-box tve-elem-default-pad dynamic-group-kbt3pwhk\" data-css=\"tve-u-173d2aee1c7\">\n<div class=\"tve-cb\">\n<div class=\"thrv_wrapper tve_image_caption dynamic-group-kbt3pu4z\" data-css=\"tve-u-173d2aee1c8\"><span class=\"tve_image_frame\"><a href=\"https:\/\/dataaspirant.com\/recommends\/data-science-courses\/machine-learning-z-hands-python-r-data-science-course-udemy\/\" target=\"_blank\" class=\"hasimg thirstylinkimg\" title=\"Machine-Learning-A-Z-Hands-On-Python-&amp;amp;-R-In-Data-Science-Course-Udemy\" rel=\"nofollow noopener noreferrer\" data-linkid=\"1587\"><img loading=\"lazy\" src=\"https:\/\/i2.wp.com\/dataaspirant.com\/wp-content\/plugins\/lazy-load\/images\/1x1.trans.gif?ssl=1\" data-lazy-src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/machine-learning-1.jpg?resize=172%2C172&amp;ssl=1\" class=\"tve_image wp-image-4302\" alt=\"Machine learning\" data-id=\"4302\" width=\"172\" data-init-width=\"150\" height=\"172\" data-init-height=\"150\" title=\"machine learning 1\" data-width=\"172\" data-height=\"172\" data-css=\"tve-u-173d2aee1c9\" data-link-wrap=\"true\" data-recalc-dims=\"1\"><img loading=\"lazy\" class=\"tve_image wp-image-4302\" alt=\"Machine learning\" data-id=\"4302\" width=\"172\" data-init-width=\"150\" height=\"172\" data-init-height=\"150\" title=\"machine learning 1\" src=\"https:\/\/i0.wp.com\/dataaspirant.com\/wp-content\/uploads\/2020\/08\/machine-learning-1.jpg?resize=172%2C172&amp;ssl=1\" data-width=\"172\" data-height=\"172\" data-css=\"tve-u-173d2aee1c9\" data-link-wrap=\"true\" data-recalc-dims=\"1\"><\/a><span class=\"tve-image-overlay\"><\/span><\/span><\/div>\n<h4 class=\"\" data-css=\"tve-u-173d2aee1cb\">Machine Learning with Python<\/h4>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/dataaspirant.com\/handle-imbalanced-data-machine-learning\/<\/p>\n","protected":false},"author":0,"featured_media":259,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/258"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=258"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/258\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/259"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=258"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=258"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=258"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}