Predict Loan Repayments

In the lending industry, investors provide loans to borrowers in exchange for the promise of repayment with interest. If the borrower repays the loan, then the lender profits from the interest. However, if the borrower is unable to repay the loan, then the lender loses money. Therefore, lenders face the problem of predicting the risk of a borrower being unable to repay a loan.

To address this analysis, I’ll use publicly available data from LendingClub.com, a website that connects borrowers and investors over the Internet. This dataset represents 9,578 3-year loans that were funded through the LendingClub.com platform between May 2007 and February 2010.

The binary dependent variable not.fully.paid indicates that the loan was not paid back in full (the borrower either defaulted or the loan was “charged off,” meaning the borrower was deemed unlikely to ever pay it back).

To predict this dependent variable, I’ll use the following independent variables available to the investor when deciding whether to fund a loan:

  • credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise. purpose: The purpose of the loan (takes values “credit_card”, “debt_consolidation”, “educational”, “major_purchase”, “small_business”, and “all_other”).
  • int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates. installment: The monthly installments ($) owed by the borrower if the loan is funded.
  • log.annual.inc: The natural log of the self-reported annual income of the borrower.
  • dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
  • fico: The FICO credit score of the borrower. days.with.cr.line: The number of days the borrower has had a credit line. revol.bal: The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle).
  • revol.util: The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available).
  • inq.last.6mths: The borrower’s number of inquiries by creditors in the last 6 months.
  • delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
  • pub.rec: The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments).

Problem 1.1 - Preparing the Dataset

Load the dataset loans.csv into a dataframe called loans, and explore it using the str() and summary() functions.

loans <- read.csv("loans.csv")
str(loans)
'data.frame':   9578 obs. of  14 variables:
 $ credit.policy    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ purpose          : Factor w/ 7 levels "all_other","credit_card",..: 3 2 3 3 2 2 3 1 5 3 ...
 $ int.rate         : num  0.119 0.107 0.136 0.101 0.143 ...
 $ installment      : num  829 228 367 162 103 ...
 $ log.annual.inc   : num  11.4 11.1 10.4 11.4 11.3 ...
 $ dti              : num  19.5 14.3 11.6 8.1 15 ...
 $ fico             : int  737 707 682 712 667 727 667 722 682 707 ...
 $ days.with.cr.line: num  5640 2760 4710 2700 4066 ...
 $ revol.bal        : int  28854 33623 3511 33667 4740 50807 3839 24220 69909 5630 ...
 $ revol.util       : num  52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
 $ inq.last.6mths   : int  0 0 1 1 0 0 0 0 1 1 ...
 $ delinq.2yrs      : int  0 0 0 0 1 0 0 0 0 0 ...
 $ pub.rec          : int  0 0 0 0 0 0 1 0 0 0 ...
 $ not.fully.paid   : int  0 0 0 0 0 0 1 1 0 0 ...
summary(loans)
 credit.policy                 purpose        int.rate     
 Min.   :0.000   all_other         :2331   Min.   :0.0600  
 1st Qu.:1.000   credit_card       :1262   1st Qu.:0.1039  
 Median :1.000   debt_consolidation:3957   Median :0.1221  
 Mean   :0.805   educational       : 343   Mean   :0.1226  
 3rd Qu.:1.000   home_improvement  : 629   3rd Qu.:0.1407  
 Max.   :1.000   major_purchase    : 437   Max.   :0.2164  
                 small_business    : 619                   
  installment     log.annual.inc        dti              fico      
 Min.   : 15.67   Min.   : 7.548   Min.   : 0.000   Min.   :612.0  
 1st Qu.:163.77   1st Qu.:10.558   1st Qu.: 7.213   1st Qu.:682.0  
 Median :268.95   Median :10.928   Median :12.665   Median :707.0  
 Mean   :319.09   Mean   :10.932   Mean   :12.607   Mean   :710.8  
 3rd Qu.:432.76   3rd Qu.:11.290   3rd Qu.:17.950   3rd Qu.:737.0  
 Max.   :940.14   Max.   :14.528   Max.   :29.960   Max.   :827.0  
                  NA's   :4                                        
 days.with.cr.line   revol.bal         revol.util     inq.last.6mths  
 Min.   :  179     Min.   :      0   Min.   :  0.00   Min.   : 0.000  
 1st Qu.: 2820     1st Qu.:   3187   1st Qu.: 22.70   1st Qu.: 0.000  
 Median : 4140     Median :   8596   Median : 46.40   Median : 1.000  
 Mean   : 4562     Mean   :  16914   Mean   : 46.87   Mean   : 1.572  
 3rd Qu.: 5730     3rd Qu.:  18250   3rd Qu.: 71.00   3rd Qu.: 2.000  
 Max.   :17640     Max.   :1207359   Max.   :119.00   Max.   :33.000  
 NA's   :29                          NA's   :62       NA's   :29      
  delinq.2yrs         pub.rec       not.fully.paid  
 Min.   : 0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.: 0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median : 0.0000   Median :0.0000   Median :0.0000  
 Mean   : 0.1638   Mean   :0.0621   Mean   :0.1601  
 3rd Qu.: 0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
 Max.   :13.0000   Max.   :5.0000   Max.   :1.0000  
 NA's   :29        NA's   :29                       

What proportion of the loans in the dataset were not paid in full?

table(loans$not.fully.paid)

   0    1 
8045 1533 
1533 / (8045 + 1533)
[1] 0.1600543

Problem 1.2 - Preparing the Dataset

Which of the following variables has at least one missing observation?

  • log.annual.inc
  • days.with.cr.line
  • revol.util
  • inq.last.6mths
  • delinq.2yrs
  • pub.rec

Problem 1.3 - Preparing the Dataset

Which of the following is the best reason to fill in the missing values for these variables instead of removing observations with missing data? (Hint: you can use the subset() function to build a dataframe with the observations missing at least one value. To test if a variable, for example pub.rec, is missing a value, use is.na(pub.rec).)

loansNA <- subset(loans, is.na(log.annual.inc) | is.na(days.with.cr.line)
                  | is.na(revol.util) | is.na(inq.last.6mths)
                  | is.na(delinq.2yrs) | is.na(pub.rec))
str(loansNA)
'data.frame':   62 obs. of  14 variables:
 $ credit.policy    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ purpose          : Factor w/ 7 levels "all_other","credit_card",..: 1 4 3 3 6 2 1 4 1 1 ...
 $ int.rate         : num  0.113 0.11 0.113 0.123 0.106 ...
 $ installment      : num  98.7 52.4 263.2 23.4 182.4 ...
 $ log.annual.inc   : num  10.53 10.53 10.71 9.85 11.26 ...
 $ dti              : num  7.72 15.84 8.75 12.38 4.26 ...
 $ fico             : int  677 682 682 662 697 667 687 687 722 752 ...
 $ days.with.cr.line: num  1680 1830 2490 1200 4141 ...
 $ revol.bal        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ revol.util       : num  NA NA NA NA NA NA NA NA NA NA ...
 $ inq.last.6mths   : int  1 0 1 1 0 0 1 0 1 0 ...
 $ delinq.2yrs      : int  0 0 1 0 0 0 0 0 0 0 ...
 $ pub.rec          : int  0 0 0 0 1 0 0 0 0 0 ...
 $ not.fully.paid   : int  1 0 1 0 0 1 0 0 0 0 ...
summary(loansNA)
 credit.policy                  purpose      int.rate     
 Min.   :0.0000   all_other         :41   Min.   :0.0712  
 1st Qu.:0.0000   credit_card       : 3   1st Qu.:0.0933  
 Median :0.0000   debt_consolidation: 8   Median :0.1122  
 Mean   :0.3871   educational       : 3   Mean   :0.1187  
 3rd Qu.:1.0000   home_improvement  : 1   3rd Qu.:0.1456  
 Max.   :1.0000   major_purchase    : 5   Max.   :0.1913  
                  small_business    : 1                   
  installment     log.annual.inc        dti              fico      
 Min.   : 23.35   Min.   : 8.294   Min.   : 0.000   Min.   :642.0  
 1st Qu.: 78.44   1st Qu.:10.096   1st Qu.: 5.147   1st Qu.:682.0  
 Median :145.91   Median :10.639   Median :10.000   Median :707.0  
 Mean   :159.19   Mean   :10.558   Mean   : 9.184   Mean   :711.5  
 3rd Qu.:192.73   3rd Qu.:11.248   3rd Qu.:11.540   3rd Qu.:740.8  
 Max.   :859.57   Max.   :13.004   Max.   :22.720   Max.   :802.0  
                  NA's   :4                                        
 days.with.cr.line   revol.bal        revol.util  inq.last.6mths 
 Min.   : 179      Min.   :     0   Min.   : NA   Min.   :0.000  
 1st Qu.:1830      1st Qu.:     0   1st Qu.: NA   1st Qu.:0.000  
 Median :2580      Median :     0   Median : NA   Median :1.000  
 Mean   :3158      Mean   :  5476   Mean   :NaN   Mean   :1.182  
 3rd Qu.:4621      3rd Qu.:     0   3rd Qu.: NA   3rd Qu.:2.000  
 Max.   :7890      Max.   :290291   Max.   : NA   Max.   :6.000  
 NA's   :29                         NA's   :62    NA's   :29     
  delinq.2yrs        pub.rec       not.fully.paid  
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0.0000   Median :0.0000   Median :0.0000  
 Mean   :0.2121   Mean   :0.0303   Mean   :0.1935  
 3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
 Max.   :4.0000   Max.   :1.0000   Max.   :1.0000  
 NA's   :29       NA's   :29                       
table(loansNA$not.fully.paid)

 0  1 
50 12 
12 / (50 + 12)
[1] 0.1935484

We want to be able to predict risk for all borrowers, instead of just the ones with all data reported.

Problem 1.4 - Preparing the Dataset

For the rest of this problem, I’ll be using a revised version of the dataset that has the missing values filled in with multiple imputation. To ensure everybody has the same dataframe going forward, you can either run the code below in your R console (if you haven’t already, run the code install.packages(“mice”) first), or you can download and load into R the dataset we created after running the imputation: loans_imputed.csv.

IMPORTANT NOTE: On certain operating systems, the imputation results are not the same even if you set the random seed. If you decide to do the imputation yourself, please still read the provided imputed dataset (loans_imputed.csv) into R and compare your results, using the summary function. If the results are different, please make sure to use the data in loans_imputed.csv for the rest of the problem.

  • library(mice)
  • set.seed(144)
  • vars.for.imputation = setdiff(names(loans), “not.fully.paid”)
  • imputed = complete(mice(loans[vars.for.imputation]))
  • loans[vars.for.imputation] = imputed
loans <- read.csv("loans_imputed.csv")

Note, that to do this imputation, we set vars.for.imputation to all variables in the dataframe except for not.fully.paid, to impute the values using all of the other independent variables.

What best describes the process we just used to handle missing values? #### We predicted missing variable values using the available independent variables for each observation.

Problem 2.1 - Prediction Models

Now that we have prepared the dataset, we need to split it into a training and testing set. To ensure everybody obtains the same split, set the random seed to 144 (even though you already did so earlier in the problem) and use the sample.split function to select the 70% of observations for the training set (the dependent variable for sample.split is not.fully.paid). Name the dataframes train and test.

set.seed(144)
split = sample.split(loans$not.fully.paid, SplitRatio = 0.7)
train = subset(loans, split == TRUE)
test = subset(loans, split == FALSE)

Now, use logistic regression trained on the training set to predict the dependent variable not.fully.paid using all the independent variables.

LoansLog <- glm(not.fully.paid ~ ., data = train, family = binomial)

Which independent variables are significant in our model? (Significant variables have at least one star, or a Pr(>|z|) value less than 0.05.)

summary(LoansLog)

Call:
glm(formula = not.fully.paid ~ ., family = binomial, data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2049  -0.6205  -0.4951  -0.3606   2.6397  

Coefficients:
                            Estimate Std. Error z value Pr(>|z|)    
(Intercept)                9.187e+00  1.554e+00   5.910 3.42e-09 ***
credit.policy             -3.368e-01  1.011e-01  -3.332 0.000861 ***
purposecredit_card        -6.141e-01  1.344e-01  -4.568 4.93e-06 ***
purposedebt_consolidation -3.212e-01  9.183e-02  -3.498 0.000469 ***
purposeeducational         1.347e-01  1.753e-01   0.768 0.442201    
purposehome_improvement    1.727e-01  1.480e-01   1.167 0.243135    
purposemajor_purchase     -4.830e-01  2.009e-01  -2.404 0.016203 *  
purposesmall_business      4.120e-01  1.419e-01   2.905 0.003678 ** 
int.rate                   6.110e-01  2.085e+00   0.293 0.769446    
installment                1.275e-03  2.092e-04   6.093 1.11e-09 ***
log.annual.inc            -4.337e-01  7.148e-02  -6.067 1.30e-09 ***
dti                        4.638e-03  5.502e-03   0.843 0.399288    
fico                      -9.317e-03  1.710e-03  -5.448 5.08e-08 ***
days.with.cr.line          2.371e-06  1.588e-05   0.149 0.881343    
revol.bal                  3.085e-06  1.168e-06   2.641 0.008273 ** 
revol.util                 1.839e-03  1.535e-03   1.199 0.230722    
inq.last.6mths             8.437e-02  1.600e-02   5.275 1.33e-07 ***
delinq.2yrs               -8.320e-02  6.561e-02  -1.268 0.204762    
pub.rec                    3.300e-01  1.139e-01   2.898 0.003756 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 5896.6  on 6704  degrees of freedom
Residual deviance: 5485.2  on 6686  degrees of freedom
AIC: 5523.2

Number of Fisher Scoring iterations: 5
  • credit.policy
  • purposecredit_card
  • purposedebt_consolidation
  • purposemajor_purchase
  • purposesmall_business
  • installment
  • log.annual.inc
  • fico
  • revol.bal
  • inq.last.6mths
  • pub.rec

Problem 2.2 - Prediction Models

Consider two loan applications, which are identical other than the fact that the borrower in Application A has FICO credit score 700 while the borrower in Application B has FICO credit score 710.

Let’s Logit(A) be the log odds of loan A not being paid back in full, according to our logistic regression model, and define Logit(B) similarly for loan B.

What is the value of Logit(A) - Logit(B)?

applicationA <- train[1, ]
applicationB <- applicationA
applicationA$fico = 700
applicationB$fico = 710
applicationA
  credit.policy            purpose int.rate installment log.annual.inc
1             1 debt_consolidation   0.1189       829.1       11.35041
    dti fico days.with.cr.line revol.bal revol.util inq.last.6mths
1 19.48  700          5639.958     28854       52.1              0
  delinq.2yrs pub.rec not.fully.paid
1           0       0              0
applicationB
  credit.policy            purpose int.rate installment log.annual.inc
1             1 debt_consolidation   0.1189       829.1       11.35041
    dti fico days.with.cr.line revol.bal revol.util inq.last.6mths
1 19.48  710          5639.958     28854       52.1              0
  delinq.2yrs pub.rec not.fully.paid
1           0       0              0
applications <- rbind(applicationA, applicationB)
applications
  credit.policy            purpose int.rate installment log.annual.inc
1             1 debt_consolidation   0.1189       829.1       11.35041
2             1 debt_consolidation   0.1189       829.1       11.35041
    dti fico days.with.cr.line revol.bal revol.util inq.last.6mths
1 19.48  700          5639.958     28854       52.1              0
2 19.48  710          5639.958     28854       52.1              0
  delinq.2yrs pub.rec not.fully.paid
1           0       0              0
2           0       0              0
PredApplications <- predict(LoansLog, type = "response", newdata = applications)
PredApplications
        1         2 
0.1828795 0.1693660 
PredApplications[1] - PredApplications[2]
         1 
0.01351347 
CalcApplicationA <- 
    1 / (1 + exp(-1 * (9.187e+00 + -9.317e-03 * 700)))
CalcApplicationB <- 
    1 / (1 + exp(-1 * (9.187e+00 + -9.317e-03 * 710)))
CalcApplicationA - CalcApplicationB
[1] 0.005902546

Now, let O(A) be the odds of loan A not being paid back in full, according to our logistic regression model, and define O(B) similarly for loan B.

What is the value of O(A)/O(B)? (HINT: Use the mathematical rule that exp(A + B + C) = exp(A)exp(B)exp(C). Also, remember that exp() is the exponential function in R.)

OddsApplicationA <- PredApplications[1] / (1 - PredApplications[1])
OddsApplicationA
        1 
0.2238097 
OddsApplicationB <- PredApplications[2] / (1 - PredApplications[2])
OddsApplicationB
        2 
0.2038997 
OddsApplicationA / OddsApplicationB
       1 
1.097646 

1.097646 => OK!

After computing the logs, try log(odds) for previous problem

log(OddsApplicationA)
        1 
-1.496959 
log(OddsApplicationB)
        2 
-1.590127 
log(OddsApplicationA) - log(OddsApplicationB)
        1 
0.0931679 

0.0931679 => OK!

Problem 2.3 - Prediction Models

Predict the probability of the test-set loans not being paid back in full (remember type=“response” for the predict function). Store these predicted probabilities in a variable named predicted.risk and add it to our test-set (we will use this variable in later parts of the problem). Compute the confusion matrix using a threshold of 0.5.

predicted.risk <- predict(LoansLog, type = "response", newdata = test)
str(predicted.risk)
 Named num [1:2873] 0.0771 0.1728 0.1087 0.1016 0.0681 ...
 - attr(*, "names")= chr [1:2873] "2" "3" "10" "12" ...
str(test)
'data.frame':   2873 obs. of  14 variables:
 $ credit.policy    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ purpose          : Factor w/ 7 levels "all_other","credit_card",..: 2 3 3 3 1 3 1 2 7 2 ...
 $ int.rate         : num  0.107 0.136 0.122 0.132 0.08 ...
 $ installment      : num  228.2 366.9 84.1 253.6 188 ...
 $ log.annual.inc   : num  11.1 10.4 10.2 11.8 11.2 ...
 $ dti              : num  14.29 11.63 10 9.16 16.08 ...
 $ fico             : int  707 682 707 662 772 662 772 797 712 682 ...
 $ days.with.cr.line: num  2760 4710 2730 4298 4889 ...
 $ revol.bal        : int  33623 3511 5630 5122 29797 4175 3660 6844 3534 43039 ...
 $ revol.util       : num  76.7 25.6 23 18.2 23.2 51.5 6.8 14.4 54.4 93.4 ...
 $ inq.last.6mths   : int  0 1 1 2 1 0 0 0 0 3 ...
 $ delinq.2yrs      : int  0 0 0 1 0 1 0 0 0 0 ...
 $ pub.rec          : int  0 0 0 0 0 0 0 0 0 0 ...
 $ not.fully.paid   : int  0 0 0 0 0 0 0 0 0 0 ...
test$predicted.risk <- predicted.risk
summary(test)
 credit.policy                  purpose        int.rate     
 Min.   :0.0000   all_other         : 688   Min.   :0.0600  
 1st Qu.:1.0000   credit_card       : 411   1st Qu.:0.1028  
 Median :1.0000   debt_consolidation:1206   Median :0.1221  
 Mean   :0.8047   educational       :  93   Mean   :0.1225  
 3rd Qu.:1.0000   home_improvement  : 186   3rd Qu.:0.1393  
 Max.   :1.0000   major_purchase    : 105   Max.   :0.2121  
                  small_business    : 184                   
  installment     log.annual.inc        dti             fico      
 Min.   : 15.67   Min.   : 8.102   Min.   : 0.00   Min.   :612.0  
 1st Qu.:163.57   1st Qu.:10.560   1st Qu.: 7.16   1st Qu.:682.0  
 Median :267.74   Median :10.933   Median :12.85   Median :707.0  
 Mean   :316.99   Mean   :10.928   Mean   :12.75   Mean   :710.8  
 3rd Qu.:421.89   3rd Qu.:11.290   3rd Qu.:18.30   3rd Qu.:737.0  
 Max.   :926.83   Max.   :13.459   Max.   :29.96   Max.   :822.0  
                                                                  
 days.with.cr.line   revol.bal         revol.util     inq.last.6mths  
 Min.   :  179     Min.   :      0   Min.   :  0.00   Min.   : 0.000  
 1st Qu.: 2795     1st Qu.:   3362   1st Qu.: 23.40   1st Qu.: 0.000  
 Median : 4140     Median :   8712   Median : 46.90   Median : 1.000  
 Mean   : 4494     Mean   :  17198   Mean   : 47.02   Mean   : 1.576  
 3rd Qu.: 5670     3rd Qu.:  18728   3rd Qu.: 70.40   3rd Qu.: 2.000  
 Max.   :17640     Max.   :1207359   Max.   :108.80   Max.   :24.000  
                                                                      
  delinq.2yrs         pub.rec        not.fully.paid   predicted.risk   
 Min.   : 0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.02114  
 1st Qu.: 0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.09461  
 Median : 0.0000   Median :0.00000   Median :0.0000   Median :0.13697  
 Mean   : 0.1605   Mean   :0.05743   Mean   :0.1601   Mean   :0.15785  
 3rd Qu.: 0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.19658  
 Max.   :13.0000   Max.   :3.00000   Max.   :1.0000   Max.   :0.95373  
                                                                       
table(test$not.fully.paid, test$predicted.risk > 0.5)
   
    FALSE TRUE
  0  2400   13
  1   457    3

What is the accuracy of the logistic regression model?

(2400 + 3) / nrow(test)
[1] 0.8364079

What is the accuracy of the baseline model?

table(test$not.fully.paid)

   0    1 
2413  460 
2413 / (2413 + 460)
[1] 0.8398886

Problem 2.4 - Prediction Models

Use the ROCR package to compute the test-set AUC.

ROCRpred = prediction(test$predicted.risk, test$not.fully.paid)
as.numeric(performance(ROCRpred, "auc")@y.values)
[1] 0.6720995

The model has poor accuracy at the threshold 0.5. But, despite the poor accuracy, we will see later how an investor can still leverage this logistic regression model to make profitable investments.

Problem 3.1 - A “Smart Baseline”

In the previous problem, we built a logistic regression model that has an AUC significantly higher than the AUC of 0.5 that would be obtained by randomly ordering observations. However, LendingClub.com assigns the interest rate to a loan based on their estimate of that loan’s risk.

This variable, int.rate, is an independent variable in our dataset. In this part, we will investigate using the loan’s interest rate as a “smart baseline” to order the loans according to risk.

Using the training set, build a bivariate logistic regression model (aka a logistic regression model with a single independent variable) that predicts the dependent variable not.fully.paid using only the variable int.rate.

LoansLog2 <- glm(not.fully.paid ~ int.rate, data = train, family = binomial)
summary(LoansLog2)

Call:
glm(formula = not.fully.paid ~ int.rate, family = binomial, data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0547  -0.6271  -0.5442  -0.4361   2.2914  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -3.6726     0.1688  -21.76   <2e-16 ***
int.rate     15.9214     1.2702   12.54   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 5896.6  on 6704  degrees of freedom
Residual deviance: 5734.8  on 6703  degrees of freedom
AIC: 5738.8

Number of Fisher Scoring iterations: 4

The variable int.rate is highly significant in the bivariate model, but it is not significant at the 0.05 level in the model trained with all the independent variables.

What is the most likely explanation for this difference? #### int.rate is correlated with other risk-related variables, and therefore does not incrementally improve the model when those other variables are included.

Problem 3.2 - A “Smart Baseline”

Make test-set predictions for the bivariate model.

What is the highest predicted probability of a loan not being paid in full on the testing set?

predicted.risk2 <- predict(LoansLog2, type = "response", newdata = test)
max(predicted.risk2)
[1] 0.426624

0.426624

With a logistic regression cut-off of 0.5, how many loans would be predicted as not being paid in full on the testing set?

table(test$not.fully.paid, predicted.risk2 > 0.5)
   
    FALSE
  0  2413
  1   460
str(test)
'data.frame':   2873 obs. of  15 variables:
 $ credit.policy    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ purpose          : Factor w/ 7 levels "all_other","credit_card",..: 2 3 3 3 1 3 1 2 7 2 ...
 $ int.rate         : num  0.107 0.136 0.122 0.132 0.08 ...
 $ installment      : num  228.2 366.9 84.1 253.6 188 ...
 $ log.annual.inc   : num  11.1 10.4 10.2 11.8 11.2 ...
 $ dti              : num  14.29 11.63 10 9.16 16.08 ...
 $ fico             : int  707 682 707 662 772 662 772 797 712 682 ...
 $ days.with.cr.line: num  2760 4710 2730 4298 4889 ...
 $ revol.bal        : int  33623 3511 5630 5122 29797 4175 3660 6844 3534 43039 ...
 $ revol.util       : num  76.7 25.6 23 18.2 23.2 51.5 6.8 14.4 54.4 93.4 ...
 $ inq.last.6mths   : int  0 1 1 2 1 0 0 0 0 3 ...
 $ delinq.2yrs      : int  0 0 0 1 0 1 0 0 0 0 ...
 $ pub.rec          : int  0 0 0 0 0 0 0 0 0 0 ...
 $ not.fully.paid   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ predicted.risk   : num  0.0771 0.1728 0.1087 0.1016 0.0681 ...

0

Problem 3.3 - A “Smart Baseline”

What is the test-set AUC of the bivariate model?

ROCRpred2 = prediction(predicted.risk2, test$not.fully.paid)
as.numeric(performance(ROCRpred2, "auc")@y.values)
[1] 0.6239081

Problem 4.1 - Computing the Profitability of an Investment

While thus far we have predicted if a loan will be paid back or not, an investor needs to identify loans that are expected to be profitable.

If the loan is paid back in full, then the investor makes interest on the loan. However, if the loan is not paid back, the investor loses the money invested. Therefore, the investor should seek loans that best balance this risk and reward.

To compute interest revenue, consider a $c investment in a loan that has an annual interest rate r over a period of t years.

Using continuous compounding of interest, this investment pays back c * exp(rt) dollars by the end of the t years, where exp(rt) is e raised to the r*t power.

How much does a $10 investment with an annual interest rate of 6% pay back after 3 years, using continuous compounding of interest? Hint: remember to convert the percentage to a proportion before doing the math.

10 * exp(0.06 * 3)
[1] 11.97217

Problem 4.2 - Computing the Profitability of an Investment

While the investment has value c * exp(rt) dollars after collecting interest, the investor had to pay $c for the investment.

What is the profit to the investor if the investment is paid back in full?

10 * exp(0.06 * 3) - 10
[1] 1.972174

Problem 4.3 - Computing the Profitability of an Investment

Now, consider the case where the investor made a $c investment, but it was not paid back in full. Assume, conservatively, that no money was received from the borrower (often a lender will receive some but not all of the value of the loan, making this a pessimistic assumption of how much is received).

What is the loss to the investor in this scenario? #### -10

Problem 5.1 - A Simple Investment Strategy

In the previous subproblem, we concluded that an investor who invested c dollars in a loan with interest rate r for t years makes c * (exp(rt) - 1) dollars of profit if the loan is paid back in full and -c dollars of profit if the loan is not paid back in full (pessimistically).

In order to evaluate the quality of an investment strategy, we need to compute this profit for each loan in the test-set.

For this variable, we will assume a $1 investment (aka c=1). To create the variable, we first assign to the profit for a fully paid loan, exp(rt)-1, to every observation, and we then replace this value with -1 in the cases where the loan was not paid in full.

All the loans in our dataset are 3-year loans, meaning t=3 in our calculations.

test$profit = exp(test$int.rate*3) - 1
test$profit[test$not.fully.paid == 1] = -1

What is the maximum profit of a $10 investment in any loan in the testing set?

max(test$profit) * 10
[1] 8.894769

Problem 6.1 - An Investment Strategy Based on Risk

A simple investment strategy of equally investing in all the loans would yield profit $20.94 for a $100 investment. But this simple investment strategy does not leverage the prediction model we built earlier in this problem.

As stated earlier, investors seek loans that balance reward with risk, in that they simultaneously have high interest rates and a low risk of not being paid back.

To meet this objective, I’ll analyze an investment strategy in which the investor only purchases loans with a high interest rate (a rate of at least 15%), but amongst these loans selects the ones with the lowest predicted risk of not being paid back in full.

We will model an investor who invests $1 in each of the most promising 100 loans.

First, use the subset() function to build a dataframe called highInterest consisting of the test-set loans with an interest rate of at least 15%.

highInterest <- subset(test, int.rate >= 0.15)

What is the average profit of a $1 investment in one of these high-interest loans?

mean(highInterest$profit)
[1] 0.2251015

What proportion of the high-interest loans were not paid back in full?

table(highInterest$not.fully.paid)

  0   1 
327 110 
110 / (327 + 110)
[1] 0.2517162

Problem 6.2 - An Investment Strategy Based on Risk

Next, I’ll determine the 100th smallest predicted probability of not paying in full by sorting the predicted risks in increasing order and selecting the 100th element of this sorted list.

cutoff = sort(highInterest$predicted.risk, decreasing=FALSE)[100]
cutoff
[1] 0.1763305

Use the subset() function to build a dataframe called selectedLoans consisting of the high-interest loans with predicted risk not exceeding the cut-off we just computed. Check to make sure you have selected 100 loans for investment.

selectedLoans <- subset(highInterest, highInterest$predicted.risk <= cutoff)
summary(selectedLoans)
 credit.policy                purpose      int.rate       installment    
 Min.   :0.00   all_other         : 8   Min.   :0.1501   Min.   : 48.79  
 1st Qu.:1.00   credit_card       :17   1st Qu.:0.1533   1st Qu.:176.16  
 Median :1.00   debt_consolidation:60   Median :0.1570   Median :309.37  
 Mean   :0.93   educational       : 1   Mean   :0.1610   Mean   :358.30  
 3rd Qu.:1.00   home_improvement  : 1   3rd Qu.:0.1645   3rd Qu.:473.10  
 Max.   :1.00   major_purchase    : 7   Max.   :0.2052   Max.   :907.60  
                small_business    : 6                                    
 log.annual.inc        dti             fico       days.with.cr.line
 Min.   : 9.575   Min.   : 0.00   Min.   :642.0   Min.   : 1140    
 1st Qu.:10.776   1st Qu.: 6.05   1st Qu.:662.0   1st Qu.: 2162    
 Median :11.127   Median :12.35   Median :672.0   Median : 3630    
 Mean   :11.203   Mean   :12.19   Mean   :680.5   Mean   : 3911    
 3rd Qu.:11.670   3rd Qu.:18.23   3rd Qu.:692.0   3rd Qu.: 5010    
 Max.   :13.305   Max.   :28.15   Max.   :782.0   Max.   :13170    
                                                                   
   revol.bal        revol.util    inq.last.6mths   delinq.2yrs  
 Min.   :     0   Min.   : 0.00   Min.   : 0.00   Min.   :0.00  
 1st Qu.:  3768   1st Qu.:45.92   1st Qu.: 0.00   1st Qu.:0.00  
 Median :  9691   Median :71.65   Median : 0.00   Median :0.00  
 Mean   : 19923   Mean   :65.79   Mean   : 0.89   Mean   :0.33  
 3rd Qu.: 24534   3rd Qu.:93.80   3rd Qu.: 1.00   3rd Qu.:0.00  
 Max.   :168496   Max.   :99.70   Max.   :10.00   Max.   :4.00  
                                                                
    pub.rec     not.fully.paid predicted.risk        profit       
 Min.   :0.00   Min.   :0.00   Min.   :0.06871   Min.   :-1.0000  
 1st Qu.:0.00   1st Qu.:0.00   1st Qu.:0.13596   1st Qu.: 0.5823  
 Median :0.00   Median :0.00   Median :0.15327   Median : 0.5992  
 Mean   :0.03   Mean   :0.19   Mean   :0.14794   Mean   : 0.3128  
 3rd Qu.:0.00   3rd Qu.:0.00   3rd Qu.:0.16514   3rd Qu.: 0.6317  
 Max.   :1.00   Max.   :1.00   Max.   :0.17633   Max.   : 0.8508  
                                                                  
str(selectedLoans)
'data.frame':   100 obs. of  16 variables:
 $ credit.policy    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ purpose          : Factor w/ 7 levels "all_other","credit_card",..: 7 2 3 1 3 5 2 3 2 3 ...
 $ int.rate         : num  0.15 0.153 0.158 0.159 0.156 ...
 $ installment      : num  225 444 420 246 245 ...
 $ log.annual.inc   : num  12.3 11 11.5 11.5 10.8 ...
 $ dti              : num  6.45 19.52 18.55 24.19 2.72 ...
 $ fico             : int  677 667 667 667 672 702 667 672 662 682 ...
 $ days.with.cr.line: num  6240 2701 4560 5376 3010 ...
 $ revol.bal        : int  56411 33074 34841 590 3273 4980 15977 16473 22783 87502 ...
 $ revol.util       : num  75.3 68.8 89.6 84.3 69.6 55.3 83.6 94.1 93.7 96.4 ...
 $ inq.last.6mths   : int  0 2 0 0 1 1 0 2 3 0 ...
 $ delinq.2yrs      : int  0 0 0 0 0 0 0 2 1 1 ...
 $ pub.rec          : int  0 0 0 0 0 0 0 0 0 0 ...
 $ not.fully.paid   : int  1 0 0 0 1 0 0 0 0 0 ...
 $ predicted.risk   : num  0.164 0.169 0.158 0.162 0.147 ...
 $ profit           : num  -1 0.584 0.604 0.61 -1 ...

What is the profit of the investor, who invested $1 in each of these 100 loans?

sum(selectedLoans$profit)
[1] 31.27825

How many of 100 selected loans were not paid back in full?

table(selectedLoans$not.fully.paid)

 0  1 
81 19 

19

Conclusion

We have now seen how analytics can be used to select a subset of the high-interest loans that were paid back at only a slightly lower rate than average, resulting in a significant increase in the profit from our investor’s $100 investment. Although the logistic regression models developed in this analysis did not have large AUC values, we see that they still provided the edge needed to improve the profitability of an investment portfolio.

We conclude with a note of warning. Throughout this analysis I’ve assume that the loans we invest in will perform in the same way as the loans we used to train our model, even though our training set covers a relatively short period of time. If there is an economic shock like a large financial downturn, default rates might be significantly higher than those observed in the training set and we might end up losing money instead of profiting. Investors must pay careful attention to such risk when making investment decisions.

Avatar
Rihad Variawa
Data Scientist

I am the Sr. Data Scientist at Malastare AI and head of global Fintech Research, responsible for overall vision and strategy, investment priorities and offering development. Working in the financial services industry, helping clients adopt new technologies that can transform the way they transact and engage with their customers. I am passionate about data science, super inquisitive and challenge seeker; looking at everything through a lens of numbers and problem-solver at the core. From understanding a business problem to collecting and visualizing data, until the stage of prototyping, fine-tuning and deploying models to real-world applications, I find the fulfillment of tackling challenges to solve complex problems using data.

Next
Previous