Predict Parole Violators

Apr 7, 2019

In many criminal justice systems around the world, inmates deemed not to be a threat to society are released from prison under the parole system prior to completing their sentence. They are still considered to be serving their sentence while on parole, and they can be returned to prison if they violate the terms of their parole.

Parole boards are charged with identifying which inmates are good candidates for release on parole. They seek to release inmates who will not commit additional crimes after release. In this analysis, I’ll build and validate a model that predicts if an inmate will violate the terms of his or her parole.

Such a model could be useful to a parole board when deciding to approve or deny an application for parole.

For this prediction task, I’ll use data from the U.S 2004 National Corrections Reporting Program, a nationwide census of parole releases that occurred during 2004.

I’ve limited my focus to parolees who served no more than 6 months in prison and whose maximum sentence for all charges did not exceed 18 months.

The dataset contains all such parolees who either successfully completed their term of parole during 2004 or those who violated the terms of their parole during that year. The dataset contains the following variables:

male: 1 if the parolee is male, 0 if female
race: 1 if the parolee is white, 2 otherwise
age: the parolee’s age (in years) when he or she was released from prison
state: a code for the parolee’s state. 2 is Kentucky, 3 is Louisiana, 4 is Virginia, and 1 is any other state. The three states were selected due to having a high representation in the dataset.
time.served: the number of months the parolee served in prison (limited by the inclusion criteria to not exceed 6 months).
max.sentence: the maximum sentence length for all charges, in months (limited by the inclusion criteria to not exceed 18 months).
multiple.offenses: 1 if the parolee was incarcerated for multiple offenses, 0 otherwise.
crime: a code for the parolee’s main crime leading to incarceration. 2 is larceny, 3 is drug-related crime, 4 is driving-related crime, and 1 is any other crime.
violator: 1 if the parolee violated the parole, and 0 if the parolee completed the parole without violation.

Loading the Dataset & EDA

parole <- read.csv("parole.csv")
str(parole)

'data.frame':   675 obs. of  9 variables:
 $ male             : int  1 0 1 1 1 1 1 0 0 1 ...
 $ race             : int  1 1 2 1 2 2 1 1 1 2 ...
 $ age              : num  33.2 39.7 29.5 22.4 21.6 46.7 31 24.6 32.6 29.1 ...
 $ state            : int  1 1 1 1 1 1 1 1 1 1 ...
 $ time.served      : num  5.5 5.4 5.6 5.7 5.4 6 6 4.8 4.5 4.7 ...
 $ max.sentence     : int  18 12 12 18 12 18 18 12 13 12 ...
 $ multiple.offenses: int  0 0 0 0 0 0 0 0 0 0 ...
 $ crime            : int  4 3 3 1 1 4 3 1 3 2 ...
 $ violator         : int  0 0 0 0 0 0 0 0 0 0 ...

summary(parole)

      male             race            age            state      
 Min.   :0.0000   Min.   :1.000   Min.   :18.40   Min.   :1.000  
 1st Qu.:1.0000   1st Qu.:1.000   1st Qu.:25.35   1st Qu.:2.000  
 Median :1.0000   Median :1.000   Median :33.70   Median :3.000  
 Mean   :0.8074   Mean   :1.424   Mean   :34.51   Mean   :2.887  
 3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:42.55   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :2.000   Max.   :67.00   Max.   :4.000  
  time.served     max.sentence   multiple.offenses     crime      
 Min.   :0.000   Min.   : 1.00   Min.   :0.0000    Min.   :1.000  
 1st Qu.:3.250   1st Qu.:12.00   1st Qu.:0.0000    1st Qu.:1.000  
 Median :4.400   Median :12.00   Median :1.0000    Median :2.000  
 Mean   :4.198   Mean   :13.06   Mean   :0.5363    Mean   :2.059  
 3rd Qu.:5.200   3rd Qu.:15.00   3rd Qu.:1.0000    3rd Qu.:3.000  
 Max.   :6.000   Max.   :18.00   Max.   :1.0000    Max.   :4.000  
    violator     
 Min.   :0.0000  
 1st Qu.:0.0000  
 Median :0.0000  
 Mean   :0.1156  
 3rd Qu.:0.0000  
 Max.   :1.0000

How many parolees are contained in the dataset? #### 675

Problem 1.1 - Preparing the Dataset

Which variables in this dataset are unordered factors with at least three levels? #### state, crime

Problem 1.2 - Preparing the Dataset

In the last subproblem, we identified variables that are unordered factors with at least 3 levels, so we need to convert them to factors for our prediction problem.

Using the as.factor() function, we convert these variables to factors. Keep in mind that we are not changing the values, just the way R understands them (the values are still numbers).

parole$state <- as.factor(parole$state)
parole$crime <- as.factor(parole$crime)

How does the output of summary() change for a factor variable as compared to a numerical variable?

summary(parole)

      male             race            age        state    time.served   
 Min.   :0.0000   Min.   :1.000   Min.   :18.40   1:143   Min.   :0.000  
 1st Qu.:1.0000   1st Qu.:1.000   1st Qu.:25.35   2:120   1st Qu.:3.250  
 Median :1.0000   Median :1.000   Median :33.70   3: 82   Median :4.400  
 Mean   :0.8074   Mean   :1.424   Mean   :34.51   4:330   Mean   :4.198  
 3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:42.55           3rd Qu.:5.200  
 Max.   :1.0000   Max.   :2.000   Max.   :67.00           Max.   :6.000  
  max.sentence   multiple.offenses crime      violator     
 Min.   : 1.00   Min.   :0.0000    1:315   Min.   :0.0000  
 1st Qu.:12.00   1st Qu.:0.0000    2:106   1st Qu.:0.0000  
 Median :12.00   Median :1.0000    3:153   Median :0.0000  
 Mean   :13.06   Mean   :0.5363    4:101   Mean   :0.1156  
 3rd Qu.:15.00   3rd Qu.:1.0000            3rd Qu.:0.0000  
 Max.   :18.00   Max.   :1.0000            Max.   :1.0000

The output becomes similar to that of the table() function applied to that variable

Problem 2.1 - Splitting into a Training and Testing Set

To ensure consistent training/testing set splits, run the following 5 lines of code (do not include the line numbers at the beginning):

set.seed(144)
# 70% to the training set, 30% to the testing set
split = sample.split(parole$violator, SplitRatio = 0.7)
train = subset(parole, split == TRUE)
test = subset(parole, split == FALSE)

Roughly what proportion of parolees have been allocated to the training and testing sets?

str(train)

'data.frame':   473 obs. of  9 variables:
 $ male             : int  1 1 1 1 1 0 0 1 1 1 ...
 $ race             : int  1 1 2 2 1 1 2 1 1 1 ...
 $ age              : num  33.2 22.4 21.6 46.7 31 32.6 28.4 20.5 30.1 37.8 ...
 $ state            : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
 $ time.served      : num  5.5 5.7 5.4 6 6 4.5 4.5 5.9 5.3 5.3 ...
 $ max.sentence     : int  18 18 12 18 18 13 12 12 16 8 ...
 $ multiple.offenses: int  0 0 0 0 0 0 1 0 0 0 ...
 $ crime            : Factor w/ 4 levels "1","2","3","4": 4 1 1 4 3 3 1 1 3 3 ...
 $ violator         : int  0 0 0 0 0 0 0 0 0 0 ...

473 / 675

[1] 0.7007407

str(test)

'data.frame':   202 obs. of  9 variables:
 $ male             : int  0 1 0 1 1 1 1 1 1 1 ...
 $ race             : int  1 2 1 2 2 1 1 2 1 1 ...
 $ age              : num  39.7 29.5 24.6 29.1 24.5 32.8 36.7 36.5 33.5 37.3 ...
 $ state            : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
 $ time.served      : num  5.4 5.6 4.8 4.7 6 5.9 0.9 3.9 4.2 4.6 ...
 $ max.sentence     : int  12 12 12 12 16 16 16 12 12 12 ...
 $ multiple.offenses: int  0 0 0 0 0 0 0 1 1 1 ...
 $ crime            : Factor w/ 4 levels "1","2","3","4": 3 3 1 2 3 3 3 4 1 1 ...
 $ violator         : int  0 0 0 0 0 0 0 1 1 1 ...

202 / 675

[1] 0.2992593

Problem 2.2 - Splitting into a Training and Testing Set

Now, suppose you re-ran lines [1]-[5] of Problem 3.1. What would you expect? #### The exact same training/testing set split as the first execution of [1]-[5]

If you instead ONLY re-ran lines [3]-[5], what would you expect? #### A different training/testing set split from the first execution of [1]-[5]

If you instead called set.seed() with a different number and then re-ran lines [3]-[5] of Problem 3.1, what would you expect? #### A different training/testing set split from the first execution of [1]-[5]

?sample.split

Problem 3.1 - Building a Logistic Regression Model

If you tested other training/testing set splits in the previous section, please re-run the original 5 lines of code to obtain the original split. Using glm (and remembering the parameter family=“binomial”), train a logistic regression model on the training set. Your dependent variable is “violator”, and you should use all of the other variables as independent variables.

What variables are significant in this model? Significant variables should have a least one star, or should have a probability less than 0.05 (the column Pr(>|z|) in the summary output).

ParoleViolatorLog <- glm(violator ~ ., data = train, family = binomial)
summary(ParoleViolatorLog)


Call:
glm(formula = violator ~ ., family = binomial, data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7041  -0.4236  -0.2719  -0.1690   2.8375  

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -4.2411574  1.2938852  -3.278  0.00105 ** 
male               0.3869904  0.4379613   0.884  0.37690    
race               0.8867192  0.3950660   2.244  0.02480 *  
age               -0.0001756  0.0160852  -0.011  0.99129    
state2             0.4433007  0.4816619   0.920  0.35739    
state3             0.8349797  0.5562704   1.501  0.13335    
state4            -3.3967878  0.6115860  -5.554 2.79e-08 ***
time.served       -0.1238867  0.1204230  -1.029  0.30359    
max.sentence       0.0802954  0.0553747   1.450  0.14705    
multiple.offenses  1.6119919  0.3853050   4.184 2.87e-05 ***
crime2             0.6837143  0.5003550   1.366  0.17180    
crime3            -0.2781054  0.4328356  -0.643  0.52054    
crime4            -0.0117627  0.5713035  -0.021  0.98357    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 340.04  on 472  degrees of freedom
Residual deviance: 251.48  on 460  degrees of freedom
AIC: 277.48

Number of Fisher Scoring iterations: 6

race, state4, multiple.offenses

Problem 3.2 - Building a Logistic Regression Model

What can we say based on the coefficient of the multiple.offenses variable? The following two properties might be useful to you when exploring this question:

If we have a coefficient c for a variable, then that means the log odds (or Logit) are increased by c for a unit increase in the variable.
If we have a coefficient c for a variable, then that means the odds are multiplied by e^c for a unit increase in the variable.

exp(1.6119919)

[1] 5.012786

Our model predicts that a parolee who committed multiple offenses has 5.01 times higher odds of being a violator than a parolee who did not commit multiple offenses but, is otherwise identical.

Problem 3.3 - Building a Logistic Regression Model

Consider a parolee who is male, of white race, aged 50 years at prison release, from the state of Maryland, served 3 months, had a maximum sentence of 12 months, did not commit multiple offenses, and committed a larceny.

Explore the following questions based on the model’s predictions for this individual. (HINT: We should use the coefficients of our model, the Logistic Response Function, and the Odds equation to solve this problem.) According to the model, what are the odds this individual is a violator?

exp(-4.2411574 + # intercept
        0.3869904 * 1 + # male
        0.8867192 * 1 + # white race
        -0.0001756 * 50 + # aged 50
        0.4433007*0 + 0.8349797*0 + -3.3967878*0 + # Maryland
        -0.1238867 * 3 + # served 3 months
        0.0802954 * 12 + # max sentence of 12 months
        1.6119919 * 0 + # did not commit multiple offenses
        0.6837143*1 + -0.2781054*0 + -0.0117627*0
)

[1] 0.1825687

## 0.1825687

# according to the model, what is the probability this individual is a violator?
1 / (1 + exp(-1 * (-4.2411574 + # intercept
                       0.3869904 * 1 + # male
                       0.8867192 * 1 + # white race
                       -0.0001756 * 50 + # aged 50
                       0.4433007*0 + 0.8349797*0 + -3.3967878*0 + # Maryland
                       -0.1238867 * 3 + # served 3 months
                       0.0802954 * 12 + # max sentence of 12 months
                       1.6119919 * 0 + # did not commit multiple offenses
                       0.6837143*1 + -0.2781054*0 + -0.0117627*0
                   )))

[1] 0.1543832

## Logistic Response Function -> P(y = 1) = 0.1543832

Problem 4.1 - Evaluating the Model on the Testing Set

Use the predict() function to obtain the model’s predicted probabilities for parolees in the testing set, remembering to pass type=“response”.

What is the maximum predicted probability of a violation?

ParolePredTest <- predict(ParoleViolatorLog, type = "response", newdata = test)
max(ParolePredTest)

[1] 0.9072791

Problem 4.2 - Evaluating the Model on the Testing Set

In the following questions, evaluate the model’s predictions on the test-set using a threshold of 0.5.

table(test$violator, ParolePredTest > 0.5)

   
    FALSE TRUE
  0   167   12
  1    11   12

# what is the model's sensitivity?
12 / (11 + 12) # TP / (TP + FN)

[1] 0.5217391

# what is the model's specificity?
167 / (167 + 12) # TN / (TN + FP)

[1] 0.9329609

# what is the model's accuracy?
(167 + 12) / nrow(test) # (TN + TP) / N

[1] 0.8861386

Problem 4.3 - Evaluating the Model on the Testing Set

What is the accuracy of a simple model that predicts that every parolee is a non-violator?

table(test$violator)


  0   1 
179  23

179 / (179 + 23)

[1] 0.8861386

Problem 4.4 - Evaluating the Model on the Testing Set

Consider a parole board using the model to predict whether parolees will be violators or not.

The job of a parole board is to make sure that a prisoner is ready to be released into free society, and therefore parole boards tend to be particularily concerned about releasing prisoners who will violate their parole.

Which of the following most likely describes their preferences and best course of action? #### The board assigns more cost to a false negative than a false positive, and should therefore use a logistic regression cut-off less than 0.5.

Problem 4.5 - Evaluating the Model on the Testing Set

Which of the following is the most accurate assessment of the value of the logistic regression model with a cut-off 0.5 to a parole board, based on the model’s accuracy as compared to the simple baseline model? #### The model is likely of value to the board, and using a different logistic regression cut-off is likely to improve the model’s value.

Problem 4.6 - Evaluating the Model on the Testing Set

Using the ROCR package, what is the AUC value for the model?

ROCRpred = prediction(ParolePredTest, test$violator)
as.numeric(performance(ROCRpred, "auc")@y.values)

[1] 0.8945834

Problem 4.7 - Evaluating the Model on the Testing Set

Describe the meaning of AUC in this context. #### The probability the model can correctly differentiate between a randomly selected parole violator and a randomly selected parole non-violator.

Problem 5.1 - Identifying Bias in Observational Data

Our goal has been to predict the outcome of a parole decision, and we used a publicly available dataset of parole releases for predictions.

In this final problem, we’ll evaluate a potential source of bias associated with our analysis. It is always important to evaluate a dataset for possible sources of bias.

The dataset contains all individuals released from parole in 2004, either due to completing their parole term or violating the terms of their parole. However, it does not contain parolees who neither violated their parole nor completed their term in 2004, causing non-violators to be underrepresented.

This is called “selection bias” or “selecting on the dependent variable,” because only a subset of all relevant parolees were included in our analysis, based on our dependent variable in this analysis (parole violation).

How could we improve our dataset to best address selection bias? #### We should use a dataset tracking a group of parolees from the start of their parole until either they violated parole or they completed their term.

R Data Analytics Machine Learning

Rihad Variawa

Data Scientist

I am the Sr. Data Scientist at Malastare AI and head of global Fintech Research, responsible for overall vision and strategy, investment priorities and offering development. Working in the financial services industry, helping clients adopt new technologies that can transform the way they transact and engage with their customers. I am passionate about data science, super inquisitive and challenge seeker; looking at everything through a lens of numbers and problem-solver at the core. From understanding a business problem to collecting and visualizing data, until the stage of prototyping, fine-tuning and deploying models to real-world applications, I find the fulfillment of tackling challenges to solve complex problems using data.