In the lending industry, investors provide loans to borrowers in exchange for the promise of repayment with interest. If the borrower repays the loan, then the lender profits from the interest. However, if the borrower is unable to repay the loan, then the lender loses money. Therefore, lenders face the problem of predicting the risk of a borrower being unable to repay a loan.
To address this analysis, I’ll use publicly available data from LendingClub.com, a website that connects borrowers and investors over the Internet. This dataset represents 9,578 3-year loans that were funded through the LendingClub.com platform between May 2007 and February 2010.
The binary dependent variable not.fully.paid indicates that the loan was not paid back in full (the borrower either defaulted or the loan was “charged off,” meaning the borrower was deemed unlikely to ever pay it back).
To predict this dependent variable, I’ll use the following independent variables available to the investor when deciding whether to fund a loan:
- credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise. purpose: The purpose of the loan (takes values “credit_card”, “debt_consolidation”, “educational”, “major_purchase”, “small_business”, and “all_other”).
- int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates. installment: The monthly installments ($) owed by the borrower if the loan is funded.
- log.annual.inc: The natural log of the self-reported annual income of the borrower.
- dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
- fico: The FICO credit score of the borrower. days.with.cr.line: The number of days the borrower has had a credit line. revol.bal: The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle).
- revol.util: The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available).
- inq.last.6mths: The borrower’s number of inquiries by creditors in the last 6 months.
- delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
- pub.rec: The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments).
Problem 1.1 - Preparing the Dataset
Load the dataset loans.csv into a dataframe called loans, and explore it using the str() and summary() functions.
loans <- read.csv("loans.csv")
str(loans)
'data.frame': 9578 obs. of 14 variables:
$ credit.policy : int 1 1 1 1 1 1 1 1 1 1 ...
$ purpose : Factor w/ 7 levels "all_other","credit_card",..: 3 2 3 3 2 2 3 1 5 3 ...
$ int.rate : num 0.119 0.107 0.136 0.101 0.143 ...
$ installment : num 829 228 367 162 103 ...
$ log.annual.inc : num 11.4 11.1 10.4 11.4 11.3 ...
$ dti : num 19.5 14.3 11.6 8.1 15 ...
$ fico : int 737 707 682 712 667 727 667 722 682 707 ...
$ days.with.cr.line: num 5640 2760 4710 2700 4066 ...
$ revol.bal : int 28854 33623 3511 33667 4740 50807 3839 24220 69909 5630 ...
$ revol.util : num 52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
$ inq.last.6mths : int 0 0 1 1 0 0 0 0 1 1 ...
$ delinq.2yrs : int 0 0 0 0 1 0 0 0 0 0 ...
$ pub.rec : int 0 0 0 0 0 0 1 0 0 0 ...
$ not.fully.paid : int 0 0 0 0 0 0 1 1 0 0 ...
summary(loans)
credit.policy purpose int.rate
Min. :0.000 all_other :2331 Min. :0.0600
1st Qu.:1.000 credit_card :1262 1st Qu.:0.1039
Median :1.000 debt_consolidation:3957 Median :0.1221
Mean :0.805 educational : 343 Mean :0.1226
3rd Qu.:1.000 home_improvement : 629 3rd Qu.:0.1407
Max. :1.000 major_purchase : 437 Max. :0.2164
small_business : 619
installment log.annual.inc dti fico
Min. : 15.67 Min. : 7.548 Min. : 0.000 Min. :612.0
1st Qu.:163.77 1st Qu.:10.558 1st Qu.: 7.213 1st Qu.:682.0
Median :268.95 Median :10.928 Median :12.665 Median :707.0
Mean :319.09 Mean :10.932 Mean :12.607 Mean :710.8
3rd Qu.:432.76 3rd Qu.:11.290 3rd Qu.:17.950 3rd Qu.:737.0
Max. :940.14 Max. :14.528 Max. :29.960 Max. :827.0
NA's :4
days.with.cr.line revol.bal revol.util inq.last.6mths
Min. : 179 Min. : 0 Min. : 0.00 Min. : 0.000
1st Qu.: 2820 1st Qu.: 3187 1st Qu.: 22.70 1st Qu.: 0.000
Median : 4140 Median : 8596 Median : 46.40 Median : 1.000
Mean : 4562 Mean : 16914 Mean : 46.87 Mean : 1.572
3rd Qu.: 5730 3rd Qu.: 18250 3rd Qu.: 71.00 3rd Qu.: 2.000
Max. :17640 Max. :1207359 Max. :119.00 Max. :33.000
NA's :29 NA's :62 NA's :29
delinq.2yrs pub.rec not.fully.paid
Min. : 0.0000 Min. :0.0000 Min. :0.0000
1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median : 0.0000 Median :0.0000 Median :0.0000
Mean : 0.1638 Mean :0.0621 Mean :0.1601
3rd Qu.: 0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
Max. :13.0000 Max. :5.0000 Max. :1.0000
NA's :29 NA's :29
What proportion of the loans in the dataset were not paid in full?
table(loans$not.fully.paid)
0 1
8045 1533
1533 / (8045 + 1533)
[1] 0.1600543
Problem 1.2 - Preparing the Dataset
Which of the following variables has at least one missing observation?
- log.annual.inc
- days.with.cr.line
- revol.util
- inq.last.6mths
- delinq.2yrs
- pub.rec
Problem 1.3 - Preparing the Dataset
Which of the following is the best reason to fill in the missing values for these variables instead of removing observations with missing data? (Hint: you can use the subset() function to build a dataframe with the observations missing at least one value. To test if a variable, for example pub.rec, is missing a value, use is.na(pub.rec).)
loansNA <- subset(loans, is.na(log.annual.inc) | is.na(days.with.cr.line)
| is.na(revol.util) | is.na(inq.last.6mths)
| is.na(delinq.2yrs) | is.na(pub.rec))
str(loansNA)
'data.frame': 62 obs. of 14 variables:
$ credit.policy : int 1 1 1 1 1 1 1 1 1 1 ...
$ purpose : Factor w/ 7 levels "all_other","credit_card",..: 1 4 3 3 6 2 1 4 1 1 ...
$ int.rate : num 0.113 0.11 0.113 0.123 0.106 ...
$ installment : num 98.7 52.4 263.2 23.4 182.4 ...
$ log.annual.inc : num 10.53 10.53 10.71 9.85 11.26 ...
$ dti : num 7.72 15.84 8.75 12.38 4.26 ...
$ fico : int 677 682 682 662 697 667 687 687 722 752 ...
$ days.with.cr.line: num 1680 1830 2490 1200 4141 ...
$ revol.bal : int 0 0 0 0 0 0 0 0 0 0 ...
$ revol.util : num NA NA NA NA NA NA NA NA NA NA ...
$ inq.last.6mths : int 1 0 1 1 0 0 1 0 1 0 ...
$ delinq.2yrs : int 0 0 1 0 0 0 0 0 0 0 ...
$ pub.rec : int 0 0 0 0 1 0 0 0 0 0 ...
$ not.fully.paid : int 1 0 1 0 0 1 0 0 0 0 ...
summary(loansNA)
credit.policy purpose int.rate
Min. :0.0000 all_other :41 Min. :0.0712
1st Qu.:0.0000 credit_card : 3 1st Qu.:0.0933
Median :0.0000 debt_consolidation: 8 Median :0.1122
Mean :0.3871 educational : 3 Mean :0.1187
3rd Qu.:1.0000 home_improvement : 1 3rd Qu.:0.1456
Max. :1.0000 major_purchase : 5 Max. :0.1913
small_business : 1
installment log.annual.inc dti fico
Min. : 23.35 Min. : 8.294 Min. : 0.000 Min. :642.0
1st Qu.: 78.44 1st Qu.:10.096 1st Qu.: 5.147 1st Qu.:682.0
Median :145.91 Median :10.639 Median :10.000 Median :707.0
Mean :159.19 Mean :10.558 Mean : 9.184 Mean :711.5
3rd Qu.:192.73 3rd Qu.:11.248 3rd Qu.:11.540 3rd Qu.:740.8
Max. :859.57 Max. :13.004 Max. :22.720 Max. :802.0
NA's :4
days.with.cr.line revol.bal revol.util inq.last.6mths
Min. : 179 Min. : 0 Min. : NA Min. :0.000
1st Qu.:1830 1st Qu.: 0 1st Qu.: NA 1st Qu.:0.000
Median :2580 Median : 0 Median : NA Median :1.000
Mean :3158 Mean : 5476 Mean :NaN Mean :1.182
3rd Qu.:4621 3rd Qu.: 0 3rd Qu.: NA 3rd Qu.:2.000
Max. :7890 Max. :290291 Max. : NA Max. :6.000
NA's :29 NA's :62 NA's :29
delinq.2yrs pub.rec not.fully.paid
Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.0000 Median :0.0000
Mean :0.2121 Mean :0.0303 Mean :0.1935
3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
Max. :4.0000 Max. :1.0000 Max. :1.0000
NA's :29 NA's :29
table(loansNA$not.fully.paid)
0 1
50 12
12 / (50 + 12)
[1] 0.1935484
We want to be able to predict risk for all borrowers, instead of just the ones with all data reported.
Problem 1.4 - Preparing the Dataset
For the rest of this problem, I’ll be using a revised version of the dataset that has the missing values filled in with multiple imputation. To ensure everybody has the same dataframe going forward, you can either run the code below in your R console (if you haven’t already, run the code install.packages(“mice”) first), or you can download and load into R the dataset we created after running the imputation: loans_imputed.csv.
IMPORTANT NOTE: On certain operating systems, the imputation results are not the same even if you set the random seed. If you decide to do the imputation yourself, please still read the provided imputed dataset (loans_imputed.csv) into R and compare your results, using the summary function. If the results are different, please make sure to use the data in loans_imputed.csv for the rest of the problem.
- library(mice)
- set.seed(144)
- vars.for.imputation = setdiff(names(loans), “not.fully.paid”)
- imputed = complete(mice(loans[vars.for.imputation]))
- loans[vars.for.imputation] = imputed
loans <- read.csv("loans_imputed.csv")
Note, that to do this imputation, we set vars.for.imputation to all variables in the dataframe except for not.fully.paid, to impute the values using all of the other independent variables.
What best describes the process we just used to handle missing values? #### We predicted missing variable values using the available independent variables for each observation.
Problem 2.1 - Prediction Models
Now that we have prepared the dataset, we need to split it into a training and testing set. To ensure everybody obtains the same split, set the random seed to 144 (even though you already did so earlier in the problem) and use the sample.split function to select the 70% of observations for the training set (the dependent variable for sample.split is not.fully.paid). Name the dataframes train and test.
set.seed(144)
split = sample.split(loans$not.fully.paid, SplitRatio = 0.7)
train = subset(loans, split == TRUE)
test = subset(loans, split == FALSE)
Now, use logistic regression trained on the training set to predict the dependent variable not.fully.paid using all the independent variables.
LoansLog <- glm(not.fully.paid ~ ., data = train, family = binomial)
Which independent variables are significant in our model? (Significant variables have at least one star, or a Pr(>|z|) value less than 0.05.)
summary(LoansLog)
Call:
glm(formula = not.fully.paid ~ ., family = binomial, data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2049 -0.6205 -0.4951 -0.3606 2.6397
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 9.187e+00 1.554e+00 5.910 3.42e-09 ***
credit.policy -3.368e-01 1.011e-01 -3.332 0.000861 ***
purposecredit_card -6.141e-01 1.344e-01 -4.568 4.93e-06 ***
purposedebt_consolidation -3.212e-01 9.183e-02 -3.498 0.000469 ***
purposeeducational 1.347e-01 1.753e-01 0.768 0.442201
purposehome_improvement 1.727e-01 1.480e-01 1.167 0.243135
purposemajor_purchase -4.830e-01 2.009e-01 -2.404 0.016203 *
purposesmall_business 4.120e-01 1.419e-01 2.905 0.003678 **
int.rate 6.110e-01 2.085e+00 0.293 0.769446
installment 1.275e-03 2.092e-04 6.093 1.11e-09 ***
log.annual.inc -4.337e-01 7.148e-02 -6.067 1.30e-09 ***
dti 4.638e-03 5.502e-03 0.843 0.399288
fico -9.317e-03 1.710e-03 -5.448 5.08e-08 ***
days.with.cr.line 2.371e-06 1.588e-05 0.149 0.881343
revol.bal 3.085e-06 1.168e-06 2.641 0.008273 **
revol.util 1.839e-03 1.535e-03 1.199 0.230722
inq.last.6mths 8.437e-02 1.600e-02 5.275 1.33e-07 ***
delinq.2yrs -8.320e-02 6.561e-02 -1.268 0.204762
pub.rec 3.300e-01 1.139e-01 2.898 0.003756 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 5896.6 on 6704 degrees of freedom
Residual deviance: 5485.2 on 6686 degrees of freedom
AIC: 5523.2
Number of Fisher Scoring iterations: 5
- credit.policy
- purposecredit_card
- purposedebt_consolidation
- purposemajor_purchase
- purposesmall_business
- installment
- log.annual.inc
- fico
- revol.bal
- inq.last.6mths
- pub.rec
Problem 2.2 - Prediction Models
Consider two loan applications, which are identical other than the fact that the borrower in Application A has FICO credit score 700 while the borrower in Application B has FICO credit score 710.
Let’s Logit(A) be the log odds of loan A not being paid back in full, according to our logistic regression model, and define Logit(B) similarly for loan B.
What is the value of Logit(A) - Logit(B)?
applicationA <- train[1, ]
applicationB <- applicationA
applicationA$fico = 700
applicationB$fico = 710
applicationA
credit.policy purpose int.rate installment log.annual.inc
1 1 debt_consolidation 0.1189 829.1 11.35041
dti fico days.with.cr.line revol.bal revol.util inq.last.6mths
1 19.48 700 5639.958 28854 52.1 0
delinq.2yrs pub.rec not.fully.paid
1 0 0 0
applicationB
credit.policy purpose int.rate installment log.annual.inc
1 1 debt_consolidation 0.1189 829.1 11.35041
dti fico days.with.cr.line revol.bal revol.util inq.last.6mths
1 19.48 710 5639.958 28854 52.1 0
delinq.2yrs pub.rec not.fully.paid
1 0 0 0
applications <- rbind(applicationA, applicationB)
applications
credit.policy purpose int.rate installment log.annual.inc
1 1 debt_consolidation 0.1189 829.1 11.35041
2 1 debt_consolidation 0.1189 829.1 11.35041
dti fico days.with.cr.line revol.bal revol.util inq.last.6mths
1 19.48 700 5639.958 28854 52.1 0
2 19.48 710 5639.958 28854 52.1 0
delinq.2yrs pub.rec not.fully.paid
1 0 0 0
2 0 0 0
PredApplications <- predict(LoansLog, type = "response", newdata = applications)
PredApplications
1 2
0.1828795 0.1693660
PredApplications[1] - PredApplications[2]
1
0.01351347
CalcApplicationA <-
1 / (1 + exp(-1 * (9.187e+00 + -9.317e-03 * 700)))
CalcApplicationB <-
1 / (1 + exp(-1 * (9.187e+00 + -9.317e-03 * 710)))
CalcApplicationA - CalcApplicationB
[1] 0.005902546
Now, let O(A) be the odds of loan A not being paid back in full, according to our logistic regression model, and define O(B) similarly for loan B.
What is the value of O(A)/O(B)? (HINT: Use the mathematical rule that exp(A + B + C) = exp(A)exp(B)exp(C). Also, remember that exp() is the exponential function in R.)
OddsApplicationA <- PredApplications[1] / (1 - PredApplications[1])
OddsApplicationA
1
0.2238097
OddsApplicationB <- PredApplications[2] / (1 - PredApplications[2])
OddsApplicationB
2
0.2038997
OddsApplicationA / OddsApplicationB
1
1.097646
1.097646 => OK!
After computing the logs, try log(odds) for previous problem
log(OddsApplicationA)
1
-1.496959
log(OddsApplicationB)
2
-1.590127
log(OddsApplicationA) - log(OddsApplicationB)
1
0.0931679
0.0931679 => OK!
Problem 2.3 - Prediction Models
Predict the probability of the test-set loans not being paid back in full (remember type=“response” for the predict function). Store these predicted probabilities in a variable named predicted.risk and add it to our test-set (we will use this variable in later parts of the problem). Compute the confusion matrix using a threshold of 0.5.
predicted.risk <- predict(LoansLog, type = "response", newdata = test)
str(predicted.risk)
Named num [1:2873] 0.0771 0.1728 0.1087 0.1016 0.0681 ...
- attr(*, "names")= chr [1:2873] "2" "3" "10" "12" ...
str(test)
'data.frame': 2873 obs. of 14 variables:
$ credit.policy : int 1 1 1 1 1 1 1 1 1 1 ...
$ purpose : Factor w/ 7 levels "all_other","credit_card",..: 2 3 3 3 1 3 1 2 7 2 ...
$ int.rate : num 0.107 0.136 0.122 0.132 0.08 ...
$ installment : num 228.2 366.9 84.1 253.6 188 ...
$ log.annual.inc : num 11.1 10.4 10.2 11.8 11.2 ...
$ dti : num 14.29 11.63 10 9.16 16.08 ...
$ fico : int 707 682 707 662 772 662 772 797 712 682 ...
$ days.with.cr.line: num 2760 4710 2730 4298 4889 ...
$ revol.bal : int 33623 3511 5630 5122 29797 4175 3660 6844 3534 43039 ...
$ revol.util : num 76.7 25.6 23 18.2 23.2 51.5 6.8 14.4 54.4 93.4 ...
$ inq.last.6mths : int 0 1 1 2 1 0 0 0 0 3 ...
$ delinq.2yrs : int 0 0 0 1 0 1 0 0 0 0 ...
$ pub.rec : int 0 0 0 0 0 0 0 0 0 0 ...
$ not.fully.paid : int 0 0 0 0 0 0 0 0 0 0 ...
test$predicted.risk <- predicted.risk
summary(test)
credit.policy purpose int.rate
Min. :0.0000 all_other : 688 Min. :0.0600
1st Qu.:1.0000 credit_card : 411 1st Qu.:0.1028
Median :1.0000 debt_consolidation:1206 Median :0.1221
Mean :0.8047 educational : 93 Mean :0.1225
3rd Qu.:1.0000 home_improvement : 186 3rd Qu.:0.1393
Max. :1.0000 major_purchase : 105 Max. :0.2121
small_business : 184
installment log.annual.inc dti fico
Min. : 15.67 Min. : 8.102 Min. : 0.00 Min. :612.0
1st Qu.:163.57 1st Qu.:10.560 1st Qu.: 7.16 1st Qu.:682.0
Median :267.74 Median :10.933 Median :12.85 Median :707.0
Mean :316.99 Mean :10.928 Mean :12.75 Mean :710.8
3rd Qu.:421.89 3rd Qu.:11.290 3rd Qu.:18.30 3rd Qu.:737.0
Max. :926.83 Max. :13.459 Max. :29.96 Max. :822.0
days.with.cr.line revol.bal revol.util inq.last.6mths
Min. : 179 Min. : 0 Min. : 0.00 Min. : 0.000
1st Qu.: 2795 1st Qu.: 3362 1st Qu.: 23.40 1st Qu.: 0.000
Median : 4140 Median : 8712 Median : 46.90 Median : 1.000
Mean : 4494 Mean : 17198 Mean : 47.02 Mean : 1.576
3rd Qu.: 5670 3rd Qu.: 18728 3rd Qu.: 70.40 3rd Qu.: 2.000
Max. :17640 Max. :1207359 Max. :108.80 Max. :24.000
delinq.2yrs pub.rec not.fully.paid predicted.risk
Min. : 0.0000 Min. :0.00000 Min. :0.0000 Min. :0.02114
1st Qu.: 0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.09461
Median : 0.0000 Median :0.00000 Median :0.0000 Median :0.13697
Mean : 0.1605 Mean :0.05743 Mean :0.1601 Mean :0.15785
3rd Qu.: 0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.19658
Max. :13.0000 Max. :3.00000 Max. :1.0000 Max. :0.95373
table(test$not.fully.paid, test$predicted.risk > 0.5)
FALSE TRUE
0 2400 13
1 457 3
What is the accuracy of the logistic regression model?
(2400 + 3) / nrow(test)
[1] 0.8364079
What is the accuracy of the baseline model?
table(test$not.fully.paid)
0 1
2413 460
2413 / (2413 + 460)
[1] 0.8398886
Problem 2.4 - Prediction Models
Use the ROCR package to compute the test-set AUC.
ROCRpred = prediction(test$predicted.risk, test$not.fully.paid)
as.numeric(performance(ROCRpred, "auc")@y.values)
[1] 0.6720995
The model has poor accuracy at the threshold 0.5. But, despite the poor accuracy, we will see later how an investor can still leverage this logistic regression model to make profitable investments.
Problem 3.1 - A “Smart Baseline”
In the previous problem, we built a logistic regression model that has an AUC significantly higher than the AUC of 0.5 that would be obtained by randomly ordering observations. However, LendingClub.com assigns the interest rate to a loan based on their estimate of that loan’s risk.
This variable, int.rate, is an independent variable in our dataset. In this part, we will investigate using the loan’s interest rate as a “smart baseline” to order the loans according to risk.
Using the training set, build a bivariate logistic regression model (aka a logistic regression model with a single independent variable) that predicts the dependent variable not.fully.paid using only the variable int.rate.
LoansLog2 <- glm(not.fully.paid ~ int.rate, data = train, family = binomial)
summary(LoansLog2)
Call:
glm(formula = not.fully.paid ~ int.rate, family = binomial, data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0547 -0.6271 -0.5442 -0.4361 2.2914
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.6726 0.1688 -21.76 <2e-16 ***
int.rate 15.9214 1.2702 12.54 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 5896.6 on 6704 degrees of freedom
Residual deviance: 5734.8 on 6703 degrees of freedom
AIC: 5738.8
Number of Fisher Scoring iterations: 4
The variable int.rate is highly significant in the bivariate model, but it is not significant at the 0.05 level in the model trained with all the independent variables.
What is the most likely explanation for this difference? #### int.rate is correlated with other risk-related variables, and therefore does not incrementally improve the model when those other variables are included.
Problem 3.2 - A “Smart Baseline”
Make test-set predictions for the bivariate model.
What is the highest predicted probability of a loan not being paid in full on the testing set?
predicted.risk2 <- predict(LoansLog2, type = "response", newdata = test)
max(predicted.risk2)
[1] 0.426624
0.426624
With a logistic regression cut-off of 0.5, how many loans would be predicted as not being paid in full on the testing set?
table(test$not.fully.paid, predicted.risk2 > 0.5)
FALSE
0 2413
1 460
str(test)
'data.frame': 2873 obs. of 15 variables:
$ credit.policy : int 1 1 1 1 1 1 1 1 1 1 ...
$ purpose : Factor w/ 7 levels "all_other","credit_card",..: 2 3 3 3 1 3 1 2 7 2 ...
$ int.rate : num 0.107 0.136 0.122 0.132 0.08 ...
$ installment : num 228.2 366.9 84.1 253.6 188 ...
$ log.annual.inc : num 11.1 10.4 10.2 11.8 11.2 ...
$ dti : num 14.29 11.63 10 9.16 16.08 ...
$ fico : int 707 682 707 662 772 662 772 797 712 682 ...
$ days.with.cr.line: num 2760 4710 2730 4298 4889 ...
$ revol.bal : int 33623 3511 5630 5122 29797 4175 3660 6844 3534 43039 ...
$ revol.util : num 76.7 25.6 23 18.2 23.2 51.5 6.8 14.4 54.4 93.4 ...
$ inq.last.6mths : int 0 1 1 2 1 0 0 0 0 3 ...
$ delinq.2yrs : int 0 0 0 1 0 1 0 0 0 0 ...
$ pub.rec : int 0 0 0 0 0 0 0 0 0 0 ...
$ not.fully.paid : int 0 0 0 0 0 0 0 0 0 0 ...
$ predicted.risk : num 0.0771 0.1728 0.1087 0.1016 0.0681 ...
0
Problem 3.3 - A “Smart Baseline”
What is the test-set AUC of the bivariate model?
ROCRpred2 = prediction(predicted.risk2, test$not.fully.paid)
as.numeric(performance(ROCRpred2, "auc")@y.values)
[1] 0.6239081
Problem 4.1 - Computing the Profitability of an Investment
While thus far we have predicted if a loan will be paid back or not, an investor needs to identify loans that are expected to be profitable.
If the loan is paid back in full, then the investor makes interest on the loan. However, if the loan is not paid back, the investor loses the money invested. Therefore, the investor should seek loans that best balance this risk and reward.
To compute interest revenue, consider a $c investment in a loan that has an annual interest rate r over a period of t years.
Using continuous compounding of interest, this investment pays back c * exp(rt) dollars by the end of the t years, where exp(rt) is e raised to the r*t power.
How much does a $10 investment with an annual interest rate of 6% pay back after 3 years, using continuous compounding of interest? Hint: remember to convert the percentage to a proportion before doing the math.
10 * exp(0.06 * 3)
[1] 11.97217
Problem 4.2 - Computing the Profitability of an Investment
While the investment has value c * exp(rt) dollars after collecting interest, the investor had to pay $c for the investment.
What is the profit to the investor if the investment is paid back in full?
10 * exp(0.06 * 3) - 10
[1] 1.972174
Problem 4.3 - Computing the Profitability of an Investment
Now, consider the case where the investor made a $c investment, but it was not paid back in full. Assume, conservatively, that no money was received from the borrower (often a lender will receive some but not all of the value of the loan, making this a pessimistic assumption of how much is received).
What is the loss to the investor in this scenario? #### -10
Problem 5.1 - A Simple Investment Strategy
In the previous subproblem, we concluded that an investor who invested c dollars in a loan with interest rate r for t years makes c * (exp(rt) - 1) dollars of profit if the loan is paid back in full and -c dollars of profit if the loan is not paid back in full (pessimistically).
In order to evaluate the quality of an investment strategy, we need to compute this profit for each loan in the test-set.
For this variable, we will assume a $1 investment (aka c=1). To create the variable, we first assign to the profit for a fully paid loan, exp(rt)-1, to every observation, and we then replace this value with -1 in the cases where the loan was not paid in full.
All the loans in our dataset are 3-year loans, meaning t=3 in our calculations.
test$profit = exp(test$int.rate*3) - 1
test$profit[test$not.fully.paid == 1] = -1
What is the maximum profit of a $10 investment in any loan in the testing set?
max(test$profit) * 10
[1] 8.894769
Problem 6.1 - An Investment Strategy Based on Risk
A simple investment strategy of equally investing in all the loans would yield profit $20.94 for a $100 investment. But this simple investment strategy does not leverage the prediction model we built earlier in this problem.
As stated earlier, investors seek loans that balance reward with risk, in that they simultaneously have high interest rates and a low risk of not being paid back.
To meet this objective, I’ll analyze an investment strategy in which the investor only purchases loans with a high interest rate (a rate of at least 15%), but amongst these loans selects the ones with the lowest predicted risk of not being paid back in full.
We will model an investor who invests $1 in each of the most promising 100 loans.
First, use the subset() function to build a dataframe called highInterest consisting of the test-set loans with an interest rate of at least 15%.
highInterest <- subset(test, int.rate >= 0.15)
What is the average profit of a $1 investment in one of these high-interest loans?
mean(highInterest$profit)
[1] 0.2251015
What proportion of the high-interest loans were not paid back in full?
table(highInterest$not.fully.paid)
0 1
327 110
110 / (327 + 110)
[1] 0.2517162
Problem 6.2 - An Investment Strategy Based on Risk
Next, I’ll determine the 100th smallest predicted probability of not paying in full by sorting the predicted risks in increasing order and selecting the 100th element of this sorted list.
cutoff = sort(highInterest$predicted.risk, decreasing=FALSE)[100]
cutoff
[1] 0.1763305
Use the subset() function to build a dataframe called selectedLoans consisting of the high-interest loans with predicted risk not exceeding the cut-off we just computed. Check to make sure you have selected 100 loans for investment.
selectedLoans <- subset(highInterest, highInterest$predicted.risk <= cutoff)
summary(selectedLoans)
credit.policy purpose int.rate installment
Min. :0.00 all_other : 8 Min. :0.1501 Min. : 48.79
1st Qu.:1.00 credit_card :17 1st Qu.:0.1533 1st Qu.:176.16
Median :1.00 debt_consolidation:60 Median :0.1570 Median :309.37
Mean :0.93 educational : 1 Mean :0.1610 Mean :358.30
3rd Qu.:1.00 home_improvement : 1 3rd Qu.:0.1645 3rd Qu.:473.10
Max. :1.00 major_purchase : 7 Max. :0.2052 Max. :907.60
small_business : 6
log.annual.inc dti fico days.with.cr.line
Min. : 9.575 Min. : 0.00 Min. :642.0 Min. : 1140
1st Qu.:10.776 1st Qu.: 6.05 1st Qu.:662.0 1st Qu.: 2162
Median :11.127 Median :12.35 Median :672.0 Median : 3630
Mean :11.203 Mean :12.19 Mean :680.5 Mean : 3911
3rd Qu.:11.670 3rd Qu.:18.23 3rd Qu.:692.0 3rd Qu.: 5010
Max. :13.305 Max. :28.15 Max. :782.0 Max. :13170
revol.bal revol.util inq.last.6mths delinq.2yrs
Min. : 0 Min. : 0.00 Min. : 0.00 Min. :0.00
1st Qu.: 3768 1st Qu.:45.92 1st Qu.: 0.00 1st Qu.:0.00
Median : 9691 Median :71.65 Median : 0.00 Median :0.00
Mean : 19923 Mean :65.79 Mean : 0.89 Mean :0.33
3rd Qu.: 24534 3rd Qu.:93.80 3rd Qu.: 1.00 3rd Qu.:0.00
Max. :168496 Max. :99.70 Max. :10.00 Max. :4.00
pub.rec not.fully.paid predicted.risk profit
Min. :0.00 Min. :0.00 Min. :0.06871 Min. :-1.0000
1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.13596 1st Qu.: 0.5823
Median :0.00 Median :0.00 Median :0.15327 Median : 0.5992
Mean :0.03 Mean :0.19 Mean :0.14794 Mean : 0.3128
3rd Qu.:0.00 3rd Qu.:0.00 3rd Qu.:0.16514 3rd Qu.: 0.6317
Max. :1.00 Max. :1.00 Max. :0.17633 Max. : 0.8508
str(selectedLoans)
'data.frame': 100 obs. of 16 variables:
$ credit.policy : int 1 1 1 1 1 1 1 1 1 1 ...
$ purpose : Factor w/ 7 levels "all_other","credit_card",..: 7 2 3 1 3 5 2 3 2 3 ...
$ int.rate : num 0.15 0.153 0.158 0.159 0.156 ...
$ installment : num 225 444 420 246 245 ...
$ log.annual.inc : num 12.3 11 11.5 11.5 10.8 ...
$ dti : num 6.45 19.52 18.55 24.19 2.72 ...
$ fico : int 677 667 667 667 672 702 667 672 662 682 ...
$ days.with.cr.line: num 6240 2701 4560 5376 3010 ...
$ revol.bal : int 56411 33074 34841 590 3273 4980 15977 16473 22783 87502 ...
$ revol.util : num 75.3 68.8 89.6 84.3 69.6 55.3 83.6 94.1 93.7 96.4 ...
$ inq.last.6mths : int 0 2 0 0 1 1 0 2 3 0 ...
$ delinq.2yrs : int 0 0 0 0 0 0 0 2 1 1 ...
$ pub.rec : int 0 0 0 0 0 0 0 0 0 0 ...
$ not.fully.paid : int 1 0 0 0 1 0 0 0 0 0 ...
$ predicted.risk : num 0.164 0.169 0.158 0.162 0.147 ...
$ profit : num -1 0.584 0.604 0.61 -1 ...
What is the profit of the investor, who invested $1 in each of these 100 loans?
sum(selectedLoans$profit)
[1] 31.27825
How many of 100 selected loans were not paid back in full?
table(selectedLoans$not.fully.paid)
0 1
81 19
19
Conclusion
We have now seen how analytics can be used to select a subset of the high-interest loans that were paid back at only a slightly lower rate than average, resulting in a significant increase in the profit from our investor’s $100 investment. Although the logistic regression models developed in this analysis did not have large AUC values, we see that they still provided the edge needed to improve the profitability of an investment portfolio.
We conclude with a note of warning. Throughout this analysis I’ve assume that the loans we invest in will perform in the same way as the loans we used to train our model, even though our training set covers a relatively short period of time. If there is an economic shock like a large financial downturn, default rates might be significantly higher than those observed in the training set and we might end up losing money instead of profiting. Investors must pay careful attention to such risk when making investment decisions.