{"id":7296,"date":"2020-10-28T06:01:26","date_gmt":"2020-10-28T06:01:26","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/10\/28\/loan-default-detection\/"},"modified":"2020-10-28T06:01:26","modified_gmt":"2020-10-28T06:01:26","slug":"loan-default-detection","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/10\/28\/loan-default-detection\/","title":{"rendered":"Loan Default Detection"},"content":{"rendered":"<div>\n<h3>Motivation<\/h3>\n<p>In 2005, a Taiwanese bank conducted a study on the likelihood of clients defaulting on their loan payments. The motivation behind the study was the increase in amounts of credit being offered by banks to customers, regardless of their repayment capabilities. This led to customers accumulating significant amounts of debt, which in turn resulted in defaults.<\/p>\n<p>The goal was to use basic information about customers along with their past repayment history to predict their likelihood of default. Our objective is to use the previous 6 months of repayment history to try and predict whether the customer will default the following month.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img data-srcset=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/cc-default-454575-KrvOFQXN-300x169.jpg 300w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/cc-default-454575-KrvOFQXN-600x338.jpg 600w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/cc-default-454575-KrvOFQXN-768x432.jpg 768w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/cc-default-454575-KrvOFQXN-1024x576.jpg 1024w\" loading=\"lazy\" width=\"1024\" height=\"576\" alt=\"\" data-src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/cc-default-454575-KrvOFQXN-1024x576.jpg\" data-sizes=\"(max-width: 1024px) 100vw, 1024px\" class=\"wp-image-68013 lazyload\" src=\"image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\"><img loading=\"lazy\" width=\"1024\" height=\"576\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/cc-default-454575-KrvOFQXN-1024x576.jpg\" alt=\"\" class=\"wp-image-68013\"><\/figure>\n<\/div>\n<h3>Dataset<\/h3>\n<p>The dataset used was collected by a Taiwanese bank in October 2005 and can be downloaded from UCI&#8217;s Machine Learning Repository. <\/p>\n<ul id=\"block-1702ff19-acbe-475d-aef4-b5043d14fc01\">\n<li>\n<strong>Sex:<\/strong> Gender<\/li>\n<li>\n<strong>Age:<\/strong> Client&#8217;s age<\/li>\n<li>\n<strong>Marriage:<\/strong> Marital status<\/li>\n<li>\n<strong>Education:<\/strong> Level of education<\/li>\n<li>\n<strong>Limit balance:<\/strong> Amount of credit (NT Dollars)<\/li>\n<li>\n<strong>Payment status (month):<\/strong> Current repayment status<\/li>\n<li>\n<strong>Bill statement (month):<\/strong> The amount of bill statements (NT Dollars)<\/li>\n<li>\n<strong>Previous payment (month):<\/strong> Previous payment amount (NT Dollars)<\/li>\n<li>\n<strong>Default payment next month: <\/strong>The target variable indicating whether the customer defaulted on the payment the following month.<\/li>\n<\/ul>\n<div class=\"wp-block-group\">\n<div class=\"wp-block-group__inner-container\">\n<h3>Exploratory Data Analysis<\/h3>\n<p>Apart from age appearing to be uncorrelated to other features., the correlation matrix doesn&#8217;t provide us with much additional information.<\/p>\n<\/div>\n<\/div>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img data-srcset=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/corr-130337-nbNamYbv-300x245.png 300w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/corr-130337-nbNamYbv-600x491.png 600w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/corr-130337-nbNamYbv-768x628.png 768w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/corr-130337-nbNamYbv.png 792w\" loading=\"lazy\" width=\"792\" height=\"648\" alt=\"\" data-src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/corr-130337-nbNamYbv.png\" data-sizes=\"(max-width: 792px) 100vw, 792px\" class=\"wp-image-68003 lazyload\" src=\"image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\"><img loading=\"lazy\" width=\"792\" height=\"648\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/corr-130337-nbNamYbv.png\" alt=\"\" class=\"wp-image-68003\"><figcaption>\u200b<\/p>\n<p><strong>CORRELATION MATRIX<\/strong><\/p>\n<\/figcaption><\/figure>\n<\/div>\n<p>The pairplot doesn&#8217;t show much of a difference in the shape of the distribution per gender. We can also see a decrease in limt balances as age starts to increase beyond 55.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img data-srcset=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/pair-685011-j3wQvOzM-300x262.jpg 300w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/pair-685011-j3wQvOzM-600x523.jpg 600w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/pair-685011-j3wQvOzM.jpg 641w\" loading=\"lazy\" width=\"641\" height=\"559\" alt=\"\" data-src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/pair-685011-j3wQvOzM.jpg\" data-sizes=\"(max-width: 641px) 100vw, 641px\" class=\"wp-image-68012 lazyload\" src=\"image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\"><img loading=\"lazy\" width=\"641\" height=\"559\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/pair-685011-j3wQvOzM.jpg\" alt=\"\" class=\"wp-image-68012\"><figcaption><strong>PAIRPLOT<\/strong><\/figcaption><\/figure>\n<\/div>\n<h3>Model Selection<\/h3>\n<h5><em>CLASSIFICATION<\/em><\/h5>\n<p>We will fit a decision tree classifier for our binary classification problem. A confusion matrix will help summarize all possible combinations of the predicted values as opposed to the actual target in the form of:<\/p>\n<ul>\n<li>\u200bTrue positive (TP): The model predicts a default, and the client defaulted.\u200b<\/li>\n<li>False positive (FP): The model predicts a default, but the client did not default. <\/li>\n<li>\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200bTrue negative (TN): The model predicts a good customer, and the client\u200b did not default. \u200b\u200b\u200b\u200b\u200b\u200b\u200b<\/li>\n<li>False negative (FN): The model predicts a good customer, but the client defaulted.<\/li>\n<\/ul>\n<p>We can use these values to create additional evaluation criterias for our model. <\/p>\n<div class=\"wp-block-group\">\n<div class=\"wp-block-group__inner-container\">\n<ul>\n<li>Accuracy: Measures the model&#8217;s overall ability to correctly predict the class of the observation.<\/li>\n<li>Precision: Out of all default predictions, how many observations indeed defaulted.<\/li>\n<li>Recall: Out of all positive cases, how many were predicted correctly.<\/li>\n<li>Specificity: Measures what fraction of negative cases actually did not default.<\/li>\n<li>F-1 Score: A harmonic average of precision and recall.<\/li>\n<\/ul>\n<p>The importance of understanding these metrics is critical for the proper  evaluation of our model&#8217;s performance. Optimizing for a specific criteria could depend on the bank&#8217;s priorities. In terms of risk management, the bank may prefer to mitigate risk by declining more applications, as opposed to taking on riskier loans, which may result in larger losses. In this case, we would try to achieve as high recall as possible. This will achieve fewer false negatives, at the cost of more false positives. Conversely, if the bank believes it can aggressively hand out loans and still profit regardless of additional defaults, then they can aim for higher precision. This will get fewer false positives, at the cost of more false negatives. Ultimately, the metric on which we try to optimize should be selected based on the use case.<\/p>\n<h4>Models<\/h4>\n<\/div>\n<\/div>\n<p>We will start with a basic decision tree model and follow with a more sophisticated random forest model.<\/p>\n<h5><em>BASE TREE<\/em><\/h5>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img data-srcset=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/base-cf-278338-BvNo1YmM-300x200.png 300w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/base-cf-278338-BvNo1YmM.png 432w\" loading=\"lazy\" width=\"432\" height=\"288\" alt=\"\" data-src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/base-cf-278338-BvNo1YmM.png\" data-sizes=\"(max-width: 432px) 100vw, 432px\" class=\"wp-image-68685 lazyload\" src=\"image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\"><img loading=\"lazy\" width=\"432\" height=\"288\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/base-cf-278338-BvNo1YmM.png\" alt=\"\" class=\"wp-image-68685\"><\/figure>\n<\/div>\n<h5>\n<em>RANDOM FOREST<\/em>\u200b<\/h5>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img data-srcset=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/rf-cf-971395-FOTMlyIh-300x200.png 300w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/rf-cf-971395-FOTMlyIh.png 432w\" loading=\"lazy\" width=\"432\" height=\"288\" alt=\"\" data-src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/rf-cf-971395-FOTMlyIh.png\" data-sizes=\"(max-width: 432px) 100vw, 432px\" class=\"wp-image-68686 lazyload\" src=\"image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\"><img loading=\"lazy\" width=\"432\" height=\"288\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/rf-cf-971395-FOTMlyIh.png\" alt=\"\" class=\"wp-image-68686\"><\/figure>\n<\/div>\n<p>Above, are the results from a base tree model and a random forest model. We see a drastic increase in accuracy and precision when we apply a more sophisticated model.<\/p>\n<h3>Hyperparameter Tuning<\/h3>\n<p>We will apply a grid search to tune the hyperparameters of the model in order to achieve better performance. The idea is to create a grid of all possible hyperparameter combinations and train the model using each one of them. The search will help us identify the optimal parameter within the grid. <\/p>\n<h5><em>TUNED BASE TREE<\/em><\/h5>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img data-srcset=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/baseht-cf-780324-7YRrQmVQ-300x200.png 300w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/baseht-cf-780324-7YRrQmVQ.png 432w\" loading=\"lazy\" width=\"432\" height=\"288\" alt=\"\" data-src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/baseht-cf-780324-7YRrQmVQ.png\" data-sizes=\"(max-width: 432px) 100vw, 432px\" class=\"wp-image-68701 lazyload\" src=\"image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\"><img loading=\"lazy\" width=\"432\" height=\"288\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/baseht-cf-780324-7YRrQmVQ.png\" alt=\"\" class=\"wp-image-68701\"><\/figure>\n<\/div>\n<h5><em>TUNED RANDOM FOREST<\/em><\/h5>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img data-srcset=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/rfrs-cf-107332-81nZEH4d-300x200.png 300w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/rfrs-cf-107332-81nZEH4d.png 432w\" loading=\"lazy\" width=\"432\" height=\"288\" alt=\"\" data-src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/rfrs-cf-107332-81nZEH4d.png\" data-sizes=\"(max-width: 432px) 100vw, 432px\" class=\"wp-image-68700 lazyload\" src=\"image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\"><img loading=\"lazy\" width=\"432\" height=\"288\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/rfrs-cf-107332-81nZEH4d.png\" alt=\"\" class=\"wp-image-68700\"><\/figure>\n<\/div>\n<p>Tuning the hyperparameters led to an increased accuracy and precision from our previous models. This was the result of optimizing our model with the help of an exhaustive grid search.\u00a0<\/p>\n<h3>Conclusion<\/h3>\n<h5><em>MODEL SUMMARY<\/em><\/h5>\n<figure class=\"wp-block-image size-large\"><img data-srcset=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/screenshot-2020-10-11-221054-081035-Up5wZl0r-300x69.png 300w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/screenshot-2020-10-11-221054-081035-Up5wZl0r-600x139.png 600w, https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/screenshot-2020-10-11-221054-081035-Up5wZl0r.png 645w\" loading=\"lazy\" width=\"645\" height=\"149\" alt=\"\" data-src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/screenshot-2020-10-11-221054-081035-Up5wZl0r.png\" data-sizes=\"(max-width: 645px) 100vw, 645px\" class=\"wp-image-68070 lazyload\" src=\"image\/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==\"><img loading=\"lazy\" width=\"645\" height=\"149\" src=\"https:\/\/nycdsa-blog-files.s3.us-east-2.amazonaws.com\/2020\/10\/shamroz-qureshi\/screenshot-2020-10-11-221054-081035-Up5wZl0r.png\" alt=\"\" class=\"wp-image-68070\"><\/figure>\n<p>We decided to select the best performing decision tree model based on recall: the percentage of all defaults correctly identified by the model. This evaluation metric makes the most sense due to target imbalance i.e the ratio of our default to non-default value. To predict defaults, we decided that we could accept the cost of more false positives, in return for reducing the number of false negatives.<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/nycdatascience.com\/blog\/student-works\/loan-default-detection\/<\/p>\n","protected":false},"author":0,"featured_media":7297,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/7296"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=7296"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/7296\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/7297"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=7296"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=7296"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=7296"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}