In many criminal justice systems around the world, inmates deemed not to be a threat to society are released from prison under the parole system prior to completing their sentence. They are still considered to be serving their sentence while on parole, and they can be returned to prison if they violate the terms of their parole.
Parole boards are charged with identifying which inmates are good candidates for release on parole. They seek to release inmates who will not commit additional crimes after release. In this analysis, I’ll build and validate a model that predicts if an inmate will violate the terms of his or her parole.
Such a model could be useful to a parole board when deciding to approve or deny an application for parole.
For this prediction task, I’ll use data from the U.S 2004 National Corrections Reporting Program, a nationwide census of parole releases that occurred during 2004.
I’ve limited my focus to parolees who served no more than 6 months in prison and whose maximum sentence for all charges did not exceed 18 months.
The dataset contains all such parolees who either successfully completed their term of parole during 2004 or those who violated the terms of their parole during that year. The dataset contains the following variables:
- male: 1 if the parolee is male, 0 if female
- race: 1 if the parolee is white, 2 otherwise
- age: the parolee’s age (in years) when he or she was released from prison
- state: a code for the parolee’s state. 2 is Kentucky, 3 is Louisiana, 4 is Virginia, and 1 is any other state. The three states were selected due to having a high representation in the dataset.
- time.served: the number of months the parolee served in prison (limited by the inclusion criteria to not exceed 6 months).
- max.sentence: the maximum sentence length for all charges, in months (limited by the inclusion criteria to not exceed 18 months).
- multiple.offenses: 1 if the parolee was incarcerated for multiple offenses, 0 otherwise.
- crime: a code for the parolee’s main crime leading to incarceration. 2 is larceny, 3 is drug-related crime, 4 is driving-related crime, and 1 is any other crime.
- violator: 1 if the parolee violated the parole, and 0 if the parolee completed the parole without violation.
Loading the Dataset & EDA
parole <- read.csv("parole.csv")
str(parole)
'data.frame': 675 obs. of 9 variables:
$ male : int 1 0 1 1 1 1 1 0 0 1 ...
$ race : int 1 1 2 1 2 2 1 1 1 2 ...
$ age : num 33.2 39.7 29.5 22.4 21.6 46.7 31 24.6 32.6 29.1 ...
$ state : int 1 1 1 1 1 1 1 1 1 1 ...
$ time.served : num 5.5 5.4 5.6 5.7 5.4 6 6 4.8 4.5 4.7 ...
$ max.sentence : int 18 12 12 18 12 18 18 12 13 12 ...
$ multiple.offenses: int 0 0 0 0 0 0 0 0 0 0 ...
$ crime : int 4 3 3 1 1 4 3 1 3 2 ...
$ violator : int 0 0 0 0 0 0 0 0 0 0 ...
summary(parole)
male race age state
Min. :0.0000 Min. :1.000 Min. :18.40 Min. :1.000
1st Qu.:1.0000 1st Qu.:1.000 1st Qu.:25.35 1st Qu.:2.000
Median :1.0000 Median :1.000 Median :33.70 Median :3.000
Mean :0.8074 Mean :1.424 Mean :34.51 Mean :2.887
3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:42.55 3rd Qu.:4.000
Max. :1.0000 Max. :2.000 Max. :67.00 Max. :4.000
time.served max.sentence multiple.offenses crime
Min. :0.000 Min. : 1.00 Min. :0.0000 Min. :1.000
1st Qu.:3.250 1st Qu.:12.00 1st Qu.:0.0000 1st Qu.:1.000
Median :4.400 Median :12.00 Median :1.0000 Median :2.000
Mean :4.198 Mean :13.06 Mean :0.5363 Mean :2.059
3rd Qu.:5.200 3rd Qu.:15.00 3rd Qu.:1.0000 3rd Qu.:3.000
Max. :6.000 Max. :18.00 Max. :1.0000 Max. :4.000
violator
Min. :0.0000
1st Qu.:0.0000
Median :0.0000
Mean :0.1156
3rd Qu.:0.0000
Max. :1.0000
How many parolees are contained in the dataset? #### 675
Problem 1.1 - Preparing the Dataset
Which variables in this dataset are unordered factors with at least three levels? #### state, crime
Problem 1.2 - Preparing the Dataset
In the last subproblem, we identified variables that are unordered factors with at least 3 levels, so we need to convert them to factors for our prediction problem.
Using the as.factor() function, we convert these variables to factors. Keep in mind that we are not changing the values, just the way R understands them (the values are still numbers).
parole$state <- as.factor(parole$state)
parole$crime <- as.factor(parole$crime)
How does the output of summary() change for a factor variable as compared to a numerical variable?
summary(parole)
male race age state time.served
Min. :0.0000 Min. :1.000 Min. :18.40 1:143 Min. :0.000
1st Qu.:1.0000 1st Qu.:1.000 1st Qu.:25.35 2:120 1st Qu.:3.250
Median :1.0000 Median :1.000 Median :33.70 3: 82 Median :4.400
Mean :0.8074 Mean :1.424 Mean :34.51 4:330 Mean :4.198
3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:42.55 3rd Qu.:5.200
Max. :1.0000 Max. :2.000 Max. :67.00 Max. :6.000
max.sentence multiple.offenses crime violator
Min. : 1.00 Min. :0.0000 1:315 Min. :0.0000
1st Qu.:12.00 1st Qu.:0.0000 2:106 1st Qu.:0.0000
Median :12.00 Median :1.0000 3:153 Median :0.0000
Mean :13.06 Mean :0.5363 4:101 Mean :0.1156
3rd Qu.:15.00 3rd Qu.:1.0000 3rd Qu.:0.0000
Max. :18.00 Max. :1.0000 Max. :1.0000
The output becomes similar to that of the table() function applied to that variable
Problem 2.1 - Splitting into a Training and Testing Set
To ensure consistent training/testing set splits, run the following 5 lines of code (do not include the line numbers at the beginning):
set.seed(144)
# 70% to the training set, 30% to the testing set
split = sample.split(parole$violator, SplitRatio = 0.7)
train = subset(parole, split == TRUE)
test = subset(parole, split == FALSE)
Roughly what proportion of parolees have been allocated to the training and testing sets?
str(train)
'data.frame': 473 obs. of 9 variables:
$ male : int 1 1 1 1 1 0 0 1 1 1 ...
$ race : int 1 1 2 2 1 1 2 1 1 1 ...
$ age : num 33.2 22.4 21.6 46.7 31 32.6 28.4 20.5 30.1 37.8 ...
$ state : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
$ time.served : num 5.5 5.7 5.4 6 6 4.5 4.5 5.9 5.3 5.3 ...
$ max.sentence : int 18 18 12 18 18 13 12 12 16 8 ...
$ multiple.offenses: int 0 0 0 0 0 0 1 0 0 0 ...
$ crime : Factor w/ 4 levels "1","2","3","4": 4 1 1 4 3 3 1 1 3 3 ...
$ violator : int 0 0 0 0 0 0 0 0 0 0 ...
473 / 675
[1] 0.7007407
str(test)
'data.frame': 202 obs. of 9 variables:
$ male : int 0 1 0 1 1 1 1 1 1 1 ...
$ race : int 1 2 1 2 2 1 1 2 1 1 ...
$ age : num 39.7 29.5 24.6 29.1 24.5 32.8 36.7 36.5 33.5 37.3 ...
$ state : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
$ time.served : num 5.4 5.6 4.8 4.7 6 5.9 0.9 3.9 4.2 4.6 ...
$ max.sentence : int 12 12 12 12 16 16 16 12 12 12 ...
$ multiple.offenses: int 0 0 0 0 0 0 0 1 1 1 ...
$ crime : Factor w/ 4 levels "1","2","3","4": 3 3 1 2 3 3 3 4 1 1 ...
$ violator : int 0 0 0 0 0 0 0 1 1 1 ...
202 / 675
[1] 0.2992593
Problem 2.2 - Splitting into a Training and Testing Set
Now, suppose you re-ran lines [1]-[5] of Problem 3.1. What would you expect? #### The exact same training/testing set split as the first execution of [1]-[5]
If you instead ONLY re-ran lines [3]-[5], what would you expect? #### A different training/testing set split from the first execution of [1]-[5]
If you instead called set.seed() with a different number and then re-ran lines [3]-[5] of Problem 3.1, what would you expect? #### A different training/testing set split from the first execution of [1]-[5]
?sample.split
Problem 3.1 - Building a Logistic Regression Model
If you tested other training/testing set splits in the previous section, please re-run the original 5 lines of code to obtain the original split. Using glm (and remembering the parameter family=“binomial”), train a logistic regression model on the training set. Your dependent variable is “violator”, and you should use all of the other variables as independent variables.
What variables are significant in this model? Significant variables should have a least one star, or should have a probability less than 0.05 (the column Pr(>|z|) in the summary output).
ParoleViolatorLog <- glm(violator ~ ., data = train, family = binomial)
summary(ParoleViolatorLog)
Call:
glm(formula = violator ~ ., family = binomial, data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7041 -0.4236 -0.2719 -0.1690 2.8375
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.2411574 1.2938852 -3.278 0.00105 **
male 0.3869904 0.4379613 0.884 0.37690
race 0.8867192 0.3950660 2.244 0.02480 *
age -0.0001756 0.0160852 -0.011 0.99129
state2 0.4433007 0.4816619 0.920 0.35739
state3 0.8349797 0.5562704 1.501 0.13335
state4 -3.3967878 0.6115860 -5.554 2.79e-08 ***
time.served -0.1238867 0.1204230 -1.029 0.30359
max.sentence 0.0802954 0.0553747 1.450 0.14705
multiple.offenses 1.6119919 0.3853050 4.184 2.87e-05 ***
crime2 0.6837143 0.5003550 1.366 0.17180
crime3 -0.2781054 0.4328356 -0.643 0.52054
crime4 -0.0117627 0.5713035 -0.021 0.98357
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 340.04 on 472 degrees of freedom
Residual deviance: 251.48 on 460 degrees of freedom
AIC: 277.48
Number of Fisher Scoring iterations: 6
race, state4, multiple.offenses
Problem 3.2 - Building a Logistic Regression Model
What can we say based on the coefficient of the multiple.offenses variable? The following two properties might be useful to you when exploring this question:
- If we have a coefficient c for a variable, then that means the log odds (or Logit) are increased by c for a unit increase in the variable.
- If we have a coefficient c for a variable, then that means the odds are multiplied by e^c for a unit increase in the variable.
exp(1.6119919)
[1] 5.012786
Our model predicts that a parolee who committed multiple offenses has 5.01 times higher odds of being a violator than a parolee who did not commit multiple offenses but, is otherwise identical.
Problem 3.3 - Building a Logistic Regression Model
Consider a parolee who is male, of white race, aged 50 years at prison release, from the state of Maryland, served 3 months, had a maximum sentence of 12 months, did not commit multiple offenses, and committed a larceny.
Explore the following questions based on the model’s predictions for this individual. (HINT: We should use the coefficients of our model, the Logistic Response Function, and the Odds equation to solve this problem.) According to the model, what are the odds this individual is a violator?
exp(-4.2411574 + # intercept
0.3869904 * 1 + # male
0.8867192 * 1 + # white race
-0.0001756 * 50 + # aged 50
0.4433007*0 + 0.8349797*0 + -3.3967878*0 + # Maryland
-0.1238867 * 3 + # served 3 months
0.0802954 * 12 + # max sentence of 12 months
1.6119919 * 0 + # did not commit multiple offenses
0.6837143*1 + -0.2781054*0 + -0.0117627*0
)
[1] 0.1825687
## 0.1825687
# according to the model, what is the probability this individual is a violator?
1 / (1 + exp(-1 * (-4.2411574 + # intercept
0.3869904 * 1 + # male
0.8867192 * 1 + # white race
-0.0001756 * 50 + # aged 50
0.4433007*0 + 0.8349797*0 + -3.3967878*0 + # Maryland
-0.1238867 * 3 + # served 3 months
0.0802954 * 12 + # max sentence of 12 months
1.6119919 * 0 + # did not commit multiple offenses
0.6837143*1 + -0.2781054*0 + -0.0117627*0
)))
[1] 0.1543832
## Logistic Response Function -> P(y = 1) = 0.1543832
Problem 4.1 - Evaluating the Model on the Testing Set
Use the predict() function to obtain the model’s predicted probabilities for parolees in the testing set, remembering to pass type=“response”.
What is the maximum predicted probability of a violation?
ParolePredTest <- predict(ParoleViolatorLog, type = "response", newdata = test)
max(ParolePredTest)
[1] 0.9072791
Problem 4.2 - Evaluating the Model on the Testing Set
In the following questions, evaluate the model’s predictions on the test-set using a threshold of 0.5.
table(test$violator, ParolePredTest > 0.5)
FALSE TRUE
0 167 12
1 11 12
# what is the model's sensitivity?
12 / (11 + 12) # TP / (TP + FN)
[1] 0.5217391
# what is the model's specificity?
167 / (167 + 12) # TN / (TN + FP)
[1] 0.9329609
# what is the model's accuracy?
(167 + 12) / nrow(test) # (TN + TP) / N
[1] 0.8861386
Problem 4.3 - Evaluating the Model on the Testing Set
What is the accuracy of a simple model that predicts that every parolee is a non-violator?
table(test$violator)
0 1
179 23
179 / (179 + 23)
[1] 0.8861386
Problem 4.4 - Evaluating the Model on the Testing Set
Consider a parole board using the model to predict whether parolees will be violators or not.
The job of a parole board is to make sure that a prisoner is ready to be released into free society, and therefore parole boards tend to be particularily concerned about releasing prisoners who will violate their parole.
Which of the following most likely describes their preferences and best course of action? #### The board assigns more cost to a false negative than a false positive, and should therefore use a logistic regression cut-off less than 0.5.
Problem 4.5 - Evaluating the Model on the Testing Set
Which of the following is the most accurate assessment of the value of the logistic regression model with a cut-off 0.5 to a parole board, based on the model’s accuracy as compared to the simple baseline model? #### The model is likely of value to the board, and using a different logistic regression cut-off is likely to improve the model’s value.
Problem 4.6 - Evaluating the Model on the Testing Set
Using the ROCR package, what is the AUC value for the model?
ROCRpred = prediction(ParolePredTest, test$violator)
as.numeric(performance(ROCRpred, "auc")@y.values)
[1] 0.8945834
Problem 4.7 - Evaluating the Model on the Testing Set
Describe the meaning of AUC in this context. #### The probability the model can correctly differentiate between a randomly selected parole violator and a randomly selected parole non-violator.
Problem 5.1 - Identifying Bias in Observational Data
Our goal has been to predict the outcome of a parole decision, and we used a publicly available dataset of parole releases for predictions.
In this final problem, we’ll evaluate a potential source of bias associated with our analysis. It is always important to evaluate a dataset for possible sources of bias.
The dataset contains all individuals released from parole in 2004, either due to completing their parole term or violating the terms of their parole. However, it does not contain parolees who neither violated their parole nor completed their term in 2004, causing non-violators to be underrepresented.
This is called “selection bias” or “selecting on the dependent variable,” because only a subset of all relevant parolees were included in our analysis, based on our dependent variable in this analysis (parole violation).
How could we improve our dataset to best address selection bias? #### We should use a dataset tracking a group of parolees from the start of their parole until either they violated parole or they completed their term.