
Introduction
Lending Club is an American peer-to-peer lending company, i.e., borrowers can set up a profile and request a loan from investors who are willing to invest. The loan amount ranges from $1,000 to $40,000 and the loan can be repaid in either 36 months or 60 months. In this transaction, Lending Club acts as a platform by filtering out bad borrower profiles and assigning grades to the rest and then acts as a servicer when the transaction is finalized.
In this project, my team and I decided to create a machine learning model that can predict which loan will defaults based on the borrower’s profile and increasing the return on investments for the investors.
Data
Size
We started with a dataset that contains all the loans that were issued on the Lending Club platform from the year 2007 Q1 to year 2018 Q4. This amounts to 2.6 million rows and 153 columns that describe the loan origination and performance.
Imbalanced Dataset
As expected, the dataset is imbalanced where the minority class is loans that default. Therefore, we can say that if an investor invests blindly in a random selection of loans, we expect that 14.4% of them will default. From this, we can calculate the expected returns which will be discussed more closely in later sections. The breakdown is shown here:
Methodology
After identifying the features that was available to investors at loan origination, we performed feature engineering to better suit the models that we wanted to run.
Each model had specific requirements for it to perform well. For example, in Logistic Regression, we upsampled the minority class and in Linear Discriminant Analysis, we changed the priors of the classes.
The models that we ended up using are Logistic Regression, Linear Discriminant Analysis, Random Forest, and CatBoost.They each performed differently and the confusion matrices are shown below:
Logistic Regression
Linear Discriminant Analysis
As expected, the models with linear decision boundaries performed similarly.
Random Forest
Random Forest slightly outperformed Logistic Regression and Linear Discriminant Analysis.
CatBoost
CatBoost outperformed all the other models significantly.
Return On Investment
When dealing with investment strategies, getting a decent confusion matrix is not enough. Naturally, higher-risk loans provide higher returns. Unfortunately, all the models, except CatBoost, came out to be pretty risk-averse and would not predict those higher risk loans as Fully Paid.
Had we invested blindly (Null Model), we would expect a Compounded Annual Return of 4.57%. If we predicted the class of each loan with a 100% accuracy, our (idealized) returns would be 8.12%. Let’s examine the returns for each model:
Conclusion
We would like to show that there are many investment opportunities with lower risks and potentially higher returns.
Therefore we would recommend investors to steer clear of Lending Club. Furthermore, Lending club terminated their peer to peer platform and is now a regular bank.