Patterns In Renewable Energy Generation

Apr 11, 2019

The use of coal in the United States peaked in 2005, and since then has decreased by 25%, being replaced by renewable energy sources and more efficient use (Lovins, 2014). As the US pursues a portfolio of more diverse, sustainable and secure energy sources, there are many questions to consider.

What are effective factors in incentivizing states to adopt more environmentally friendly energy generation methods?
How do these factors vary by state?
How can we direct resources to different places in the country and ensure that they effectively drive renewable energy sources adoption?

To derive insights and explore these questions, I’ll take a combination of generation, usage, and greenhouse emission data by state and combine it with macro-economic and political information.

For this analysis, I’ve gathered data from various sources to include the following information for each state within the U.S. for the years spanning year 2000 to year 2013. The aggregated dataset energy.csv.xz results in a total of 27 variables and 699 observations. Each observation contains one record per state per year. Here’s a detailed description of the variables:

GenTotal: Annual generation of energy using all types of energy sources (coal, nuclear, hydroelectric, solar, etc.) normalized by the state population at a given year.
GenTotalRenewable: Annual generation of energy using all renewable energy sources normalized by the state population at a given year.
GenHydro, GenSolar: Annual generation of energy using each type of energy source as a percent of the total energy generation.
GenTotalRenewableBinary, GenSolarBinary: 1 if generation from solar or other renewable energy sources increased between a year n and a year n+1. 0 if it did not increase.
AllSourcesCO2, AllSourcesSO2 and AllSourcesNOx: Annual emissions per state in metric tons, normalized by the respective state population at a given year and caused by all energy generation sources.
EPriceResidential, EPriceCommercial, EPriceIndustrial, EPriceTransportation, EPriceTotal: Average electricity price per state, per sector (residential, industrial, commercial, etc.)
ESalesResidential, ESalesCommercial, ESalesIndustrial, ESalesTransportation, ESalesTotal: Annual normalized sales of electricity per state, per sector.
CumlRegulatory, CumlFinancial: Number of energy-related financial incentives and regulations created by a state per year.
Demographic data such as annual wages per capita and presidential results (0 if a state voted republican, 1 is democrat).

Problem 1 - Total Renewable Energy Generation

Load energy.csv into a data frame called energy.

energy <- read.csv("energy.csv.xz")
str(energy)

'data.frame':   699 obs. of  27 variables:
 $ STATE                  : Factor w/ 50 levels "AK","AL","AR",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ YEAR                   : int  2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ...
 $ GenTotal               : num  9.81 10.64 10.54 9.79 9.9 ...
 $ GenHydro               : num  0.163 0.2 0.213 0.25 0.23 ...
 $ GenSolar               : num  0 0 0 0 0 0 0 0 0 0 ...
 $ GenTotalRenewable      : num  0.163 0.2 0.214 0.25 0.231 ...
 $ GenSolarBinary         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ GenTotalRenewableBinary: int  1 1 1 0 0 0 1 0 1 1 ...
 $ AllSourcesCO2          : num  7.26 7.31 6.94 6.15 7.25 ...
 $ AllSourcesSO2          : num  0.0224 0.01193 0.01148 0.00675 0.00659 ...
 $ AllSourcesNOx          : num  0.0289 0.0277 0.0294 0.0242 0.0377 ...
 $ EPriceResidential      : num  11.4 12.1 12.1 12 12.4 ...
 $ EPriceCommercial       : num  9.77 10.29 10.13 10.49 10.99 ...
 $ EPriceIndustrial       : num  7.56 7.61 7.65 7.86 8.33 ...
 $ EPriceTransportation   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ EPriceTotal            : num  10.1 10.5 10.5 10.5 11 ...
 $ EsalesResidential      : num  0.349 0.347 0.354 0.357 0.356 ...
 $ EsalesCommercial       : num  0.421 0.42 0.41 0.444 0.449 ...
 $ EsalesIndustrial       : num  0.195 0.198 0.199 0.198 0.195 ...
 $ EsalesTransportation   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ EsalesOther            : num  0.0343 0.0356 0.0379 0 0 ...
 $ EsalesTotal            : num  8.46 8.61 8.51 8.59 8.78 ...
 $ CumlFinancial          : int  1 1 1 1 1 2 7 7 10 10 ...
 $ CumlRegulatory         : int  1 1 1 1 1 1 2 2 2 3 ...
 $ Total.salary           : num  17.6 18.7 19.6 20.4 21.4 ...
 $ presidential.results   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Import                 : int  0 0 0 0 0 0 0 0 0 0 ...

summary(energy)

     STATE          YEAR         GenTotal         GenHydro      
 AK     : 14   Min.   :2000   Min.   : 4.591   Min.   :0.00000  
 AL     : 14   1st Qu.:2003   1st Qu.:10.037   1st Qu.:0.00888  
 AR     : 14   Median :2006   Median :14.647   Median :0.02291  
 AZ     : 14   Mean   :2006   Mean   :16.956   Mean   :0.09853  
 CA     : 14   3rd Qu.:2010   3rd Qu.:18.018   3rd Qu.:0.07041  
 CO     : 14   Max.   :2013   Max.   :92.097   Max.   :0.92076  
 (Other):615                                                    
    GenSolar          GenTotalRenewable GenSolarBinary  
 Min.   :-1.300e-08   Min.   :0.00000   Min.   :0.0000  
 1st Qu.: 0.000e+00   1st Qu.:0.02027   1st Qu.:0.0000  
 Median : 0.000e+00   Median :0.04475   Median :0.0000  
 Mean   : 3.344e-04   Mean   :0.12245   Mean   :0.2318  
 3rd Qu.: 0.000e+00   3rd Qu.:0.10771   3rd Qu.:0.0000  
 Max.   : 2.045e-02   Max.   :0.92076   Max.   :1.0000  
                                                        
 GenTotalRenewableBinary AllSourcesCO2      AllSourcesSO2    
 Min.   :0.0000          Min.   : 0.01059   Min.   :0.00003  
 1st Qu.:0.0000          1st Qu.: 4.84645   1st Qu.:0.00925  
 Median :1.0000          Median : 8.34005   Median :0.02375  
 Mean   :0.5923          Mean   :11.61430   Mean   :0.03687  
 3rd Qu.:1.0000          3rd Qu.:12.99889   3rd Qu.:0.04197  
 Max.   :1.0000          Max.   :93.96429   Max.   :0.34568  
                         NA's   :50         NA's   :50       
 AllSourcesNOx     EPriceResidential EPriceCommercial EPriceIndustrial
 Min.   :0.00067   Min.   : 5.130    Min.   : 4.240   Min.   : 3.010  
 1st Qu.:0.00604   1st Qu.: 7.965    1st Qu.: 6.760   1st Qu.: 4.695  
 Median :0.01428   Median : 9.490    Median : 8.100   Median : 5.820  
 Mean   :0.02058   Mean   :10.424    Mean   : 8.969   Mean   : 6.660  
 3rd Qu.:0.02288   3rd Qu.:11.825    3rd Qu.:10.065   3rd Qu.: 7.500  
 Max.   :0.35610   Max.   :37.340    Max.   :34.880   Max.   :30.820  
 NA's   :50                                                           
 EPriceTransportation  EPriceTotal     EsalesResidential EsalesCommercial
 Min.   : 0.000       Min.   : 4.170   Min.   :0.1594    Min.   :0.1690  
 1st Qu.: 0.000       1st Qu.: 6.515   1st Qu.:0.3297    1st Qu.:0.2816  
 Median : 1.615       Median : 7.860   Median :0.3563    Median :0.3385  
 Mean   : 4.131       Mean   : 8.809   Mean   :0.3595    Mean   :0.3411  
 3rd Qu.: 7.940       3rd Qu.:10.095   3rd Qu.:0.3926    3rd Qu.:0.3877  
 Max.   :15.440       Max.   :34.040   Max.   :0.5287    Max.   :0.5381  
 NA's   :1                                                               
 EsalesIndustrial  EsalesTransportation  EsalesOther      EsalesTotal    
 Min.   :0.06372   Min.   :0.0000000    Min.   :0.0000   Min.   : 6.745  
 1st Qu.:0.21331   1st Qu.:0.0000000    1st Qu.:0.0000   1st Qu.:10.336  
 Median :0.29978   Median :0.0000000    Median :0.0092   Median :13.153  
 Mean   :0.29232   Mean   :0.0010349    Mean   :0.0170   Mean   :13.145  
 3rd Qu.:0.35791   3rd Qu.:0.0002786    3rd Qu.:0.0216   3rd Qu.:15.353  
 Max.   :0.59561   Max.   :0.0229279    Max.   :0.1074   Max.   :31.336  
                                        NA's   :449                      
 CumlFinancial    CumlRegulatory    Total.salary   presidential.results
 Min.   :  0.00   Min.   : 0.000   Min.   :10.65   Min.   :0.0000      
 1st Qu.:  2.00   1st Qu.: 3.000   1st Qu.:16.84   1st Qu.:0.0000      
 Median :  8.00   Median : 6.000   Median :19.51   Median :0.0000      
 Mean   : 17.58   Mean   : 6.838   Mean   :20.01   Mean   :0.4578      
 3rd Qu.: 23.50   3rd Qu.:10.000   3rd Qu.:22.83   3rd Qu.:1.0000      
 Max.   :154.00   Max.   :42.000   Max.   :33.85   Max.   :1.0000      
                                                                       
     Import      
 Min.   :0.0000  
 1st Qu.:0.0000  
 Median :0.0000  
 Mean   :0.3433  
 3rd Qu.:1.0000  
 Max.   :1.0000

Renewable energy sources are considered to include geothermal, hydroelectric, biomass, solar and wind.

Which state in the US seems to have the highest total generation of energy from renewable sources (using the variable GenTotalRenewable)?

energy[which.max(energy$GenTotalRenewable), ]

    STATE YEAR GenTotal  GenHydro GenSolar GenTotalRenewable
169    ID 2000 9.165016 0.9207631        0         0.9207631
    GenSolarBinary GenTotalRenewableBinary AllSourcesCO2 AllSourcesSO2
169              0                       0     0.6368657   0.004590802
    AllSourcesNOx EPriceResidential EPriceCommercial EPriceIndustrial
169   0.002714775              5.39             4.24             3.11
    EPriceTransportation EPriceTotal EsalesResidential EsalesCommercial
169                    0        4.17         0.3068433        0.3095307
    EsalesIndustrial EsalesTransportation EsalesOther EsalesTotal
169         0.368208                    0  0.01541804    17.57071
    CumlFinancial CumlRegulatory Total.salary presidential.results Import
169             2              2     12.81243                    0      1

Idaho (ID)

Which year did the above state produce the highest energy generation from renewable resources? #### 2000

Problem 2 - Relationship Between Politics and Greenhouse Emissions

What is the average CO2 emissions from all sources of energy for: Note Using na.rm = TRUE in our calculations!

states during years in which they voted republican?

mean(subset(energy, presidential.results == 0)$AllSourcesCO2, na.rm = TRUE)

[1] 16.44296

states during years in which they voted democrat?

mean(subset(energy, presidential.results == 1)$AllSourcesCO2, na.rm = TRUE)

[1] 5.783781

States that voted democrat have on average higher NOx emissions than states that voted republican across all years.

Is this statement true or false?

mean(subset(energy, presidential.results == 0)$AllSourcesNOx, na.rm = TRUE)

[1] 0.02985461

mean(subset(energy, presidential.results == 1)$AllSourcesNOx, na.rm = TRUE)

[1] 0.009377028

False

Problem 3 - Relationship Between Greenhouse Emissions and Energy Sales

What is the correlation between overall CO2 emissions and energy sales made to industrial facilities?

Note that the variables AllSourcesCO2 and EsalesIndustrial contain NAs. Using the parameter: use=“complete” to handle NAs in this question!

cor(energy$AllSourcesCO2, energy$EsalesIndustrial, use = "complete")

[1] 0.5385867

Choose the correct answers from the following statements:

cor(energy$AllSourcesSO2, energy$EsalesIndustrial, use = "complete")

[1] 0.4812317

cor(energy$AllSourcesNOx, energy$EsalesResidential, use = "complete")

[1] -0.5038829

cor(energy$AllSourcesCO2, energy$EsalesCommercial, use = "complete")

[1] -0.373383

Overall SO2 emissions are likely higher with increased industrial energy sales

Problem 4 - Boxplot of Energy Prices per State

Creating a boxplot of the total energy price (EPriceTotal) by State across the data, and a table summarizing the mean of EPriceTotal by State.

priceBoxplot <- ggplot(energy, aes(STATE, EPriceTotal)) 
priceBoxplot + geom_boxplot()

What observations do we make? #### - The boxplot shows a clear outlier, the state of Hawaii, with much higher energy price compared to the rest of the U.S. #### - When looking at the average energy prices, there seems to be three price tiers ($5-$9, $10-$14, and $20+)

Which state has the lowest average energy price of all? Using a table to explore this question.

sum(is.na(energy$EPriceTotal))

[1] 0

mean(energy$EPriceTotal)

[1] 8.809213

avgPriceByState <- aggregate(EPriceTotal ~ STATE, energy, mean)
avgPriceByState[which.min(avgPriceByState$EPriceTotal), ]

   STATE EPriceTotal
50    WY        5.51

Wyoming (WY)

Is this state associated with the highest mean total energy generation (GenTotal)?

avgGenByState <- aggregate(GenTotal ~ STATE, energy, mean)
avgGenByState

   STATE  GenTotal
1     AK  9.810218
2     AL 30.582000
3     AR 19.090133
4     AZ 17.211604
5     CA  5.576285
6     CO 10.367634
7     CT  9.358409
8     DE  8.256761
9     FL 12.027089
10    GA 13.969187
11    HI  8.623563
12    IA 16.270149
13    ID  8.358471
14    IL 15.193858
15    IN 19.664150
16    KS 16.732968
17    KY 22.399999
18    LA 21.384516
19    MA  6.434295
20    MD  8.352530
21    ME 13.148792
22    MI 11.201083
23    MN 10.199283
24    MO 15.062760
25    MS 16.377409
26    MT 28.942351
27    NC 13.915444
28    ND 49.909462
29    NE 18.470352
30    NH 15.671056
31    NJ  7.108700
32    NM 18.130563
33    NV 14.311625
34    NY  7.197383
35    OH 12.664461
36    OK 18.787146
37    OR 14.537263
38    PA 17.326080
39    RI  6.488628
40    SC 22.476570
41    SD 10.719863
42    TN 14.751132
43    TX 17.020082
44    UT 15.587630
45    VA  9.720367
46    VT 10.177005
47    WA 16.426394
48    WI 11.079302
49    WV 47.417141
50    WY 88.393083

avgGenByState[which.max(avgGenByState$GenTotal), ]

   STATE GenTotal
50    WY 88.39308

True

Problem 5 - Prediction Model for Solar Generation

We are interested in predicting whether states are going to increase their solar energy generation over the next year.

Let’s subset our dataset into a training and a testing set by using the following code:

set.seed(144)
spl <- sample(1 : nrow(energy), size = 0.7 * nrow(energy))
train <- energy[spl, ]
test <- energy[-spl, ]

Let’s create a logistic regression model “mod” using the train set to predict the binary variable GenSolarBinary.

To do so, consider the following as potential predictive variables: GenHydro, GenSolar, CumlFinancial, CumlRegulatory, Total.salary, Import.

mod <- glm(GenSolarBinary ~ GenHydro + GenSolar + CumlFinancial + 
             CumlRegulatory + Total.salary + Import, 
           data = train, 
           family = binomial)

Which variable is most predictive in the model?

summary(mod)


Call:
glm(formula = GenSolarBinary ~ GenHydro + GenSolar + CumlFinancial + 
    CumlRegulatory + Total.salary + Import, family = binomial, 
    data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.0165  -0.4374  -0.2398  -0.0719   2.6342  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -6.345e+00  9.148e-01  -6.935 4.05e-12 ***
GenHydro       -3.652e+00  1.114e+00  -3.279  0.00104 ** 
GenSolar        1.076e+03  3.322e+02   3.238  0.00121 ** 
CumlFinancial   2.814e-02  9.563e-03   2.943  0.00325 ** 
CumlRegulatory  1.996e-01  5.045e-02   3.955 7.65e-05 ***
Total.salary    1.393e-01  4.465e-02   3.119  0.00181 ** 
Import         -3.142e-01  3.447e-01  -0.912  0.36190    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 528.69  on 488  degrees of freedom
Residual deviance: 282.82  on 482  degrees of freedom
AIC: 296.82

Number of Fisher Scoring iterations: 7

CumlRegulatory

Problem 6 - Performance on the Test Set

Computing the predictions on the test-set. Using a threshold of 0.5, what is the accuracy of our model on the test-set?

predTest <- predict(mod, newdata = test, type = "response")
table(test$GenSolarBinary, predTest > 0.5)

   
    FALSE TRUE
  0   154    7
  1    31   18

(154 + 18) / nrow(test)

[1] 0.8190476

Cool!

What is the accuracy for states voting republican?

testWithPred <- test
testWithPred$predTest <- predTest
str(testWithPred)

'data.frame':   210 obs. of  28 variables:
 $ STATE                  : Factor w/ 50 levels "AK","AL","AR",..: 1 1 1 1 2 2 2 2 3 3 ...
 $ YEAR                   : int  2001 2007 2008 2012 2001 2006 2007 2013 2000 2001 ...
 $ GenTotal               : num  10.64 10.03 9.88 9.5 28.08 ...
 $ GenHydro               : num  0.1995 0.1893 0.173 0.2267 0.0667 ...
 $ GenSolar               : num  0 0 0 0 0 0 0 0 0 0 ...
 $ GenTotalRenewable      : num  0.1997 0.1909 0.1737 0.2325 0.0668 ...
 $ GenSolarBinary         : int  0 0 0 0 0 0 0 1 0 0 ...
 $ GenTotalRenewableBinary: int  1 0 1 1 0 0 1 1 0 1 ...
 $ AllSourcesCO2          : num  7.31 6.39 6.38 5.89 17.65 ...
 $ AllSourcesSO2          : num  0.01193 0.00631 0.00514 0.0037 0.10361 ...
 $ AllSourcesNOx          : num  0.0277 0.0253 0.0215 0.0233 0.0384 ...
 $ EPriceResidential      : num  12.12 15.18 16.56 17.88 7.01 ...
 $ EPriceCommercial       : num  10.29 12.19 13.64 14.93 6.53 ...
 $ EPriceIndustrial       : num  7.61 12.63 14.17 16.82 3.79 ...
 $ EPriceTransportation   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ EPriceTotal            : num  10.5 13.3 14.7 16.3 5.6 ...
 $ EsalesResidential      : num  0.347 0.334 0.337 0.337 0.35 ...
 $ EsalesCommercial       : num  0.42 0.447 0.451 0.448 0.238 ...
 $ EsalesIndustrial       : num  0.198 0.219 0.213 0.215 0.403 ...
 $ EsalesTransportation   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ EsalesOther            : num  0.0356 NA NA NA 0.00931 ...
 $ EsalesTotal            : num  8.61 9.31 9.23 8.78 17.78 ...
 $ CumlFinancial          : int  1 7 10 12 1 8 8 16 0 0 ...
 $ CumlRegulatory         : int  1 2 2 4 0 1 1 2 0 1 ...
 $ Total.salary           : num  18.7 25.7 27.1 31.2 13.2 ...
 $ presidential.results   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Import                 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ predTest               : num  0.01417 0.0541 0.07459 0.15572 0.00879 ...

testWithPredRep <- subset(testWithPred, presidential.results == 0)
table(testWithPredRep$GenSolarBinary, testWithPredRep$predTest > 0.5)

   
    FALSE TRUE
  0    90    0
  1    18    2

(90 + 2) / nrow(testWithPredRep)

[1] 0.8363636

What is the accuracy for states voting democrat?

testWithPredDem <- subset(testWithPred, presidential.results == 1)
table(testWithPredDem$GenSolarBinary, testWithPredDem$predTest > 0.5)

   
    FALSE TRUE
  0    64    7
  1    13   16

(64 + 16) / nrow(testWithPredDem)

[1] 0.8

Problem 7 - Clustering of the Observations

We can perhaps improve our accuracy if we implement a cluster-the-predict approach. I’m interested in clustering the observations based on information about the regulatory and financial incentives, the elections outcome and the population wealth in each state across the years, in addition to whether the state was an energy importer or not.

Let’s create a train.limited and test.limited datasets, where we only keep the variables CumlRegulatory, CumlFinancial, presidential.results, Total.salary, and Import.

str(train)

'data.frame':   489 obs. of  27 variables:
 $ STATE                  : Factor w/ 50 levels "AK","AL","AR",..: 3 36 37 25 16 32 2 37 4 41 ...
 $ YEAR                   : int  2006 2003 2002 2012 2010 2009 2011 2000 2011 2007 ...
 $ GenTotal               : num  18.5 17.3 13.4 18.3 16.9 ...
 $ GenHydro               : num  0.029722 0.029664 0.73065 0 0.000276 ...
 $ GenSolar               : num  0 0 0 0 0 ...
 $ GenTotalRenewable      : num  0.030363 0.030562 0.739667 0.000301 0.07246 ...
 $ GenSolarBinary         : int  0 0 0 0 0 1 0 0 1 0 ...
 $ GenTotalRenewableBinary: int  1 1 0 0 1 1 0 0 0 0 ...
 $ AllSourcesCO2          : num  10.28 14.09 2.05 8.13 12.78 ...
 $ AllSourcesSO2          : num  0.0291 0.0316 0.004 0.0144 0.0144 ...
 $ AllSourcesNOx          : num  0.01358 0.02472 0.00324 0.00786 0.01617 ...
 $ EPriceResidential      : num  8.85 7.47 7.12 10.26 10.03 ...
 $ EPriceCommercial       : num  6.96 6.38 6.59 9.33 8.25 ...
 $ EPriceIndustrial       : num  5.24 4.59 4.72 6.24 6.23 5.72 6.25 3.56 6.55 5.09 ...
 $ EPriceTransportation   : num  0 0 6.5 0 0 0 0 6.68 0 0 ...
 $ EPriceTotal            : num  6.99 6.35 6.32 8.6 8.35 8.09 9.1 4.89 9.71 6.89 ...
 $ EsalesResidential      : num  0.366 0.4 0.388 0.372 0.355 ...
 $ EsalesCommercial       : num  0.248 0.336 0.329 0.281 0.382 ...
 $ EsalesIndustrial       : num  0.386 0.264 0.272 0.347 0.263 ...
 $ EsalesTransportation   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ EsalesOther            : num  NA 0 0.0111 NA NA ...
 $ EsalesTotal            : num  16.6 14.4 12.9 16.2 14.2 ...
 $ CumlFinancial          : int  4 1 12 19 3 23 15 5 42 5 ...
 $ CumlRegulatory         : int  4 1 9 1 6 8 2 5 21 1 ...
 $ Total.salary           : num  15.4 13.4 16.3 15.3 20.8 ...
 $ presidential.results   : int  0 0 1 0 0 1 0 1 0 0 ...
 $ Import                 : int  0 0 0 0 0 0 0 0 0 1 ...

train.limited <- subset(train, select = CumlFinancial:Import)
str(test)

'data.frame':   210 obs. of  27 variables:
 $ STATE                  : Factor w/ 50 levels "AK","AL","AR",..: 1 1 1 1 2 2 2 2 3 3 ...
 $ YEAR                   : int  2001 2007 2008 2012 2001 2006 2007 2013 2000 2001 ...
 $ GenTotal               : num  10.64 10.03 9.88 9.5 28.08 ...
 $ GenHydro               : num  0.1995 0.1893 0.173 0.2267 0.0667 ...
 $ GenSolar               : num  0 0 0 0 0 0 0 0 0 0 ...
 $ GenTotalRenewable      : num  0.1997 0.1909 0.1737 0.2325 0.0668 ...
 $ GenSolarBinary         : int  0 0 0 0 0 0 0 1 0 0 ...
 $ GenTotalRenewableBinary: int  1 0 1 1 0 0 1 1 0 1 ...
 $ AllSourcesCO2          : num  7.31 6.39 6.38 5.89 17.65 ...
 $ AllSourcesSO2          : num  0.01193 0.00631 0.00514 0.0037 0.10361 ...
 $ AllSourcesNOx          : num  0.0277 0.0253 0.0215 0.0233 0.0384 ...
 $ EPriceResidential      : num  12.12 15.18 16.56 17.88 7.01 ...
 $ EPriceCommercial       : num  10.29 12.19 13.64 14.93 6.53 ...
 $ EPriceIndustrial       : num  7.61 12.63 14.17 16.82 3.79 ...
 $ EPriceTransportation   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ EPriceTotal            : num  10.5 13.3 14.7 16.3 5.6 ...
 $ EsalesResidential      : num  0.347 0.334 0.337 0.337 0.35 ...
 $ EsalesCommercial       : num  0.42 0.447 0.451 0.448 0.238 ...
 $ EsalesIndustrial       : num  0.198 0.219 0.213 0.215 0.403 ...
 $ EsalesTransportation   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ EsalesOther            : num  0.0356 NA NA NA 0.00931 ...
 $ EsalesTotal            : num  8.61 9.31 9.23 8.78 17.78 ...
 $ CumlFinancial          : int  1 7 10 12 1 8 8 16 0 0 ...
 $ CumlRegulatory         : int  1 2 2 4 0 1 1 2 0 1 ...
 $ Total.salary           : num  18.7 25.7 27.1 31.2 13.2 ...
 $ presidential.results   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Import                 : int  0 0 0 0 0 0 0 0 0 0 ...

test.limited <- subset(test, select = CumlFinancial:Import)

Using the “preProcess” function on the train.limited set, I can compute the train.norm and test.norm.

preprocTrain <- preProcess(train.limited)
train.norm <- predict(preprocTrain, train.limited)
preprocTest <- preProcess(test.limited)
test.norm <- predict(preprocTest, test.limited)

Why didn’t I include the dependent variable GenSolarBinary in this clustering phase? #### - Needing to know the dependent variable value to assign an observation to a cluster defeats the purpose of the cluster-then-predict methodology

Let’s use kmeans clustering for this problem with a seed of 144, k=2 and keep the maximum number of iterations at 1,000.

set.seed(144)
k <- 2
trainKMC <- kmeans(train.norm, centers = k, iter.max = 1000)
set.seed(144)
testKMC <- kmeans(test.norm, centers = k, iter.max = 1000)

Using the flexclust package, identifying the clusters and call train1 the subset of train corresponding to the first cluster, and train2 the subset of train corresponding to the second cluster.

trainKMC.kcca <- as.kcca(trainKMC, train.norm)

# clusters(trainKMC.kcca)
trainClusters <- predict(trainKMC.kcca)
table(trainClusters)

trainClusters
  1   2 
308 181

train1 <- subset(train, trainClusters == 1)
train2 <- subset(train, trainClusters == 2)
str(train1)

'data.frame':   308 obs. of  27 variables:
 $ STATE                  : Factor w/ 50 levels "AK","AL","AR",..: 3 36 37 25 16 32 2 37 41 28 ...
 $ YEAR                   : int  2006 2003 2002 2012 2010 2009 2011 2000 2007 2009 ...
 $ GenTotal               : num  18.5 17.3 13.4 18.3 16.9 ...
 $ GenHydro               : num  0.029722 0.029664 0.73065 0 0.000276 ...
 $ GenSolar               : num  0 0 0 0 0 0 0 0 0 0 ...
 $ GenTotalRenewable      : num  0.030363 0.030562 0.739667 0.000301 0.07246 ...
 $ GenSolarBinary         : int  0 0 0 0 0 1 0 0 0 0 ...
 $ GenTotalRenewableBinary: int  1 1 0 0 1 1 0 0 0 1 ...
 $ AllSourcesCO2          : num  10.28 14.09 2.05 8.13 12.78 ...
 $ AllSourcesSO2          : num  0.0291 0.0316 0.004 0.0144 0.0144 ...
 $ AllSourcesNOx          : num  0.01358 0.02472 0.00324 0.00786 0.01617 ...
 $ EPriceResidential      : num  8.85 7.47 7.12 10.26 10.03 ...
 $ EPriceCommercial       : num  6.96 6.38 6.59 9.33 8.25 ...
 $ EPriceIndustrial       : num  5.24 4.59 4.72 6.24 6.23 5.72 6.25 3.56 5.09 5.25 ...
 $ EPriceTransportation   : num  0 0 6.5 0 0 0 0 6.68 0 0 ...
 $ EPriceTotal            : num  6.99 6.35 6.32 8.6 8.35 8.09 9.1 4.89 6.89 6.63 ...
 $ EsalesResidential      : num  0.366 0.4 0.388 0.372 0.355 ...
 $ EsalesCommercial       : num  0.248 0.336 0.329 0.281 0.382 ...
 $ EsalesIndustrial       : num  0.386 0.264 0.272 0.347 0.263 ...
 $ EsalesTransportation   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ EsalesOther            : num  NA 0 0.0111 NA NA ...
 $ EsalesTotal            : num  16.6 14.4 12.9 16.2 14.2 ...
 $ CumlFinancial          : int  4 1 12 19 3 23 15 5 5 8 ...
 $ CumlRegulatory         : int  4 1 9 1 6 8 2 5 1 4 ...
 $ Total.salary           : num  15.4 13.4 16.3 15.3 20.8 ...
 $ presidential.results   : int  0 0 1 0 0 1 0 1 0 0 ...
 $ Import                 : int  0 0 0 0 0 0 0 0 1 0 ...

str(train2)

'data.frame':   181 obs. of  27 variables:
 $ STATE                  : Factor w/ 50 levels "AK","AL","AR",..: 4 9 8 19 14 8 30 10 48 22 ...
 $ YEAR                   : int  2011 2011 2002 2007 2006 2006 2010 2011 2004 2011 ...
 $ GenTotal               : num  16.7 11.61 7.46 7.24 15.13 ...
 $ GenHydro               : num  0.08485 0.00082 0 0.01694 0.0009 ...
 $ GenSolar               : num  0.000771 0.000567 0 0 0 ...
 $ GenTotalRenewable      : num  0.08841 0.01191 0 0.04075 0.00531 ...
 $ GenSolarBinary         : int  1 1 0 1 0 0 0 1 0 0 ...
 $ GenTotalRenewableBinary: int  0 0 0 1 1 1 1 0 0 1 ...
 $ AllSourcesCO2          : num  8.27 5.99 7.54 3.96 7.95 ...
 $ AllSourcesSO2          : num  0.00459 0.00592 0.03869 0.00787 0.0243 ...
 $ AllSourcesNOx          : num  0.00815 0.00434 0.01426 0.00307 0.00962 ...
 $ EPriceResidential      : num  11.08 11.51 8.7 16.23 8.42 ...
 $ EPriceCommercial       : num  9.5 9.85 7.15 15.2 7.95 ...
 $ EPriceIndustrial       : num  6.55 8.55 4.85 13.03 4.69 ...
 $ EPriceTransportation   : num  0 8.81 0 9.24 5.59 0 0 7.94 0 8.53 ...
 $ EPriceTotal            : num  9.71 10.61 6.91 15.16 7.07 ...
 $ EsalesResidential      : num  0.441 0.517 0.335 0.352 0.326 ...
 $ EsalesCommercial       : num  0.394 0.408 0.315 0.475 0.355 ...
 $ EsalesIndustrial       : num  0.165 0.075 0.345 0.165 0.315 ...
 $ EsalesTransportation   : num  0 0.00038 0 0.00705 0.00364 ...
 $ EsalesOther            : num  NA NA 0.00498 NA NA ...
 $ EsalesTotal            : num  11.58 11.78 14.94 8.79 11.2 ...
 $ CumlFinancial          : int  42 59 2 33 6 2 32 39 3 37 ...
 $ CumlRegulatory         : int  21 14 4 12 6 7 9 7 7 12 ...
 $ Total.salary           : num  22 20.7 21.3 28.9 22.7 ...
 $ presidential.results   : int  0 1 1 1 1 1 1 0 1 1 ...
 $ Import                 : int  0 1 1 1 0 1 0 1 1 0 ...

testKMC.kcca <- as.kcca(testKMC, test.norm)
# clusters(testKMC.kcca)
testClusters <- predict(testKMC.kcca)
table(testClusters)

testClusters
  1   2 
130  80

test1 <- subset(test, testClusters == 1)
test2 <- subset(test, testClusters == 2)
str(test1)

'data.frame':   130 obs. of  27 variables:
 $ STATE                  : Factor w/ 50 levels "AK","AL","AR",..: 1 1 1 1 2 2 2 2 3 3 ...
 $ YEAR                   : int  2001 2007 2008 2012 2001 2006 2007 2013 2000 2001 ...
 $ GenTotal               : num  10.64 10.03 9.88 9.5 28.08 ...
 $ GenHydro               : num  0.1995 0.1893 0.173 0.2267 0.0667 ...
 $ GenSolar               : num  0 0 0 0 0 0 0 0 0 0 ...
 $ GenTotalRenewable      : num  0.1997 0.1909 0.1737 0.2325 0.0668 ...
 $ GenSolarBinary         : int  0 0 0 0 0 0 0 1 0 0 ...
 $ GenTotalRenewableBinary: int  1 0 1 1 0 0 1 1 0 1 ...
 $ AllSourcesCO2          : num  7.31 6.39 6.38 5.89 17.65 ...
 $ AllSourcesSO2          : num  0.01193 0.00631 0.00514 0.0037 0.10361 ...
 $ AllSourcesNOx          : num  0.0277 0.0253 0.0215 0.0233 0.0384 ...
 $ EPriceResidential      : num  12.12 15.18 16.56 17.88 7.01 ...
 $ EPriceCommercial       : num  10.29 12.19 13.64 14.93 6.53 ...
 $ EPriceIndustrial       : num  7.61 12.63 14.17 16.82 3.79 ...
 $ EPriceTransportation   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ EPriceTotal            : num  10.5 13.3 14.7 16.3 5.6 ...
 $ EsalesResidential      : num  0.347 0.334 0.337 0.337 0.35 ...
 $ EsalesCommercial       : num  0.42 0.447 0.451 0.448 0.238 ...
 $ EsalesIndustrial       : num  0.198 0.219 0.213 0.215 0.403 ...
 $ EsalesTransportation   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ EsalesOther            : num  0.0356 NA NA NA 0.00931 ...
 $ EsalesTotal            : num  8.61 9.31 9.23 8.78 17.78 ...
 $ CumlFinancial          : int  1 7 10 12 1 8 8 16 0 0 ...
 $ CumlRegulatory         : int  1 2 2 4 0 1 1 2 0 1 ...
 $ Total.salary           : num  18.7 25.7 27.1 31.2 13.2 ...
 $ presidential.results   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Import                 : int  0 0 0 0 0 0 0 0 0 0 ...

str(test2)

'data.frame':   80 obs. of  27 variables:
 $ STATE                  : Factor w/ 50 levels "AK","AL","AR",..: 4 5 5 6 6 6 7 7 7 7 ...
 $ YEAR                   : int  2009 2004 2010 2008 2010 2013 2001 2003 2005 2007 ...
 $ GenTotal               : num  17 5.48 5.48 10.84 9.95 ...
 $ GenHydro               : num  0.0574 0.1753 0.1638 0.0382 0.0311 ...
 $ GenSolar               : num  0.000126 0.002931 0.003769 0.000343 0.000838 ...
 $ GenTotalRenewable      : num  0.058 0.2787 0.2711 0.0996 0.1012 ...
 $ GenSolarBinary         : int  1 0 1 1 1 1 0 0 0 0 ...
 $ GenTotalRenewableBinary: int  1 1 1 1 1 1 0 0 1 1 ...
 $ AllSourcesCO2          : num  8.12 1.69 1.49 8.45 7.95 ...
 $ AllSourcesSO2          : num  4.99e-03 6.28e-04 6.77e-05 1.11e-02 8.81e-03 ...
 $ AllSourcesNOx          : num  0.00935 0.00276 0.00214 0.01272 0.01081 ...
 $ EPriceResidential      : num  10.7 12.2 14.8 10.1 11 ...
 $ EPriceCommercial       : num  9.35 11.64 13.09 8.57 9.13 ...
 $ EPriceIndustrial       : num  6.65 9.27 9.8 6.65 6.9 ...
 $ EPriceTransportation   : num  0 6.42 8.27 8.32 9.34 ...
 $ EPriceTotal            : num  9.56 11.35 13.01 8.59 9.15 ...
 $ EsalesResidential      : num  0.447 0.331 0.338 0.34 0.342 ...
 $ EsalesCommercial       : num  0.4 0.472 0.469 0.394 0.37 ...
 $ EsalesIndustrial       : num  0.153 0.194 0.191 0.265 0.287 ...
 $ EsalesTransportation   : num  0 0.003572 0.003177 0.000932 0.000876 ...
 $ EsalesOther            : num  NA 0 NA NA NA ...
 $ EsalesTotal            : num  11.15 7.09 6.94 10.58 10.39 ...
 $ CumlFinancial          : int  33 17 138 30 58 70 3 3 4 10 ...
 $ CumlRegulatory         : int  17 20 28 18 19 20 5 5 7 12 ...
 $ Total.salary           : num  21.2 20.8 24 27 26.3 ...
 $ presidential.results   : int  0 1 1 1 1 1 1 1 1 1 ...
 $ Import                 : int  0 1 1 0 1 1 1 1 0 1 ...

Select the correct statement(s) below:

table(train1$presidential.results)


  0   1 
248  60

table(train2$presidential.results)


  0   1 
 21 160

mean(train1$CumlFinancial)

[1] 7.766234

mean(train2$CumlFinancial)

[1] 34.87845

mean(train1$CumlRegulatory)

[1] 3.983766

mean(train2$CumlRegulatory)

[1] 12.28729

mean(train1$AllSourcesCO2, na.rm = TRUE)

[1] 15.09418

mean(train2$AllSourcesCO2, na.rm = TRUE)

[1] 5.841604

mean(train1$AllSourcesSO2, na.rm = TRUE)

[1] 0.04923232

mean(train2$AllSourcesSO2, na.rm = TRUE)

[1] 0.01533239

mean(train1$AllSourcesNOx, na.rm = TRUE)

[1] 0.0282891

mean(train2$AllSourcesNOx, na.rm = TRUE)

[1] 0.007871944

- On average, train1 contains more republican states than train2

- On average, train1 contains states that have recorded more CO2, SO2 and NOx emissions than train2

Problem 8 - Creating the Model on the First Cluster

Using the variables GenHydro, GenSolar, CumlFinancial, CumlRegulatory, Total.salary and Import, creating mod1 using a logistic regression on train1.

mod1 <- glm(GenSolarBinary ~ GenHydro + GenSolar + CumlFinancial + 
               CumlRegulatory + Total.salary + Import, 
           data = train1,
           family = binomial)

What variable is most predictive?

summary(mod1)


Call:
glm(formula = GenSolarBinary ~ GenHydro + GenSolar + CumlFinancial + 
    CumlRegulatory + Total.salary + Import, family = binomial, 
    data = train1)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8654  -0.2715  -0.1692  -0.1194   2.8802  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -7.574e+00  1.945e+00  -3.893 9.89e-05 ***
GenHydro       -1.201e+00  2.082e+00  -0.577  0.56398    
GenSolar        2.324e+04  1.104e+04   2.106  0.03524 *  
CumlFinancial   6.836e-02  2.159e-02   3.166  0.00154 ** 
CumlRegulatory  1.589e-01  1.145e-01   1.388  0.16521    
Total.salary    1.549e-01  9.190e-02   1.686  0.09188 .  
Import         -3.323e-01  9.628e-01  -0.345  0.72995    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 131.538  on 307  degrees of freedom
Residual deviance:  90.376  on 301  degrees of freedom
AIC: 104.38

Number of Fisher Scoring iterations: 7

CumlFinancial

Problem 9 - Evaluating the Model Obtained Using the First Cluster

What is the accuracy on test1, the subset of test corresponding to the first cluster?

predTest1 <- predict(mod1, newdata = test1, type = "response")
table(test1$GenSolarBinary, predTest1 > 0.5)

   
    FALSE TRUE
  0   114    1
  1    11    4

(114 + 4) / nrow(test1)

[1] 0.9076923

Wow!

I’d like to know if mod1 gives us an edge over mod on the dataset test1.

Using mod, predicting GenSolarBinary for the observation in test1 and report the accuracy below:

predTest1WithMod <- predict(mod, newdata = test1, type = "response")
table(test1$GenSolarBinary, predTest1WithMod > 0.5)

115 / nrow(test1)

[1] 0.8846154

Problem 10 - Creating the Model on the Second Cluster

Using the variables GenHydro, GenSolar, CumlFinancial, CumlRegulatory, Total.salary and Import, creating mod2 using a logistic regression on train2.

mod2 <- glm(GenSolarBinary ~ GenHydro + GenSolar + CumlFinancial + 
                CumlRegulatory + Total.salary + Import, 
            data = train2,
            family = binomial)

Select the correct statement(s) below?

summary(mod2)


Call:
glm(formula = GenSolarBinary ~ GenHydro + GenSolar + CumlFinancial + 
    CumlRegulatory + Total.salary + Import, family = binomial, 
    data = train2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.6321  -0.7964   0.0083   0.8536   1.7143  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)   
(Intercept)     -4.14627    1.43969  -2.880  0.00398 **
GenHydro        -3.88698    1.23196  -3.155  0.00160 **
GenSolar       951.59397  322.86360   2.947  0.00321 **
CumlFinancial    0.02077    0.01008   2.060  0.03939 * 
CumlRegulatory   0.17315    0.06541   2.647  0.00812 **
Total.salary     0.08472    0.05576   1.519  0.12869   
Import          -0.72189    0.40427  -1.786  0.07416 . 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 250.25  on 180  degrees of freedom
Residual deviance: 173.65  on 174  degrees of freedom
AIC: 187.65

Number of Fisher Scoring iterations: 7

summary(mod1)


Call:
glm(formula = GenSolarBinary ~ GenHydro + GenSolar + CumlFinancial + 
    CumlRegulatory + Total.salary + Import, family = binomial, 
    data = train1)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8654  -0.2715  -0.1692  -0.1194   2.8802  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -7.574e+00  1.945e+00  -3.893 9.89e-05 ***
GenHydro       -1.201e+00  2.082e+00  -0.577  0.56398    
GenSolar        2.324e+04  1.104e+04   2.106  0.03524 *  
CumlFinancial   6.836e-02  2.159e-02   3.166  0.00154 ** 
CumlRegulatory  1.589e-01  1.145e-01   1.388  0.16521    
Total.salary    1.549e-01  9.190e-02   1.686  0.09188 .  
Import         -3.323e-01  9.628e-01  -0.345  0.72995    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 131.538  on 307  degrees of freedom
Residual deviance:  90.376  on 301  degrees of freedom
AIC: 104.38

Number of Fisher Scoring iterations: 7

- Unlike mod1, the number of regulatory policies is more predictive than the number of financial incentives in mod2

Problem 11 - Evaluating the Model Obtained Using the Second Cluster

Using the threshold of 0.5, what is the accuracy on test2, the subset of test corresponding to the second cluster?

predTest2 <- predict(mod2, newdata = test2, type = "response")
table(test2$GenSolarBinary, predTest2 > 0.5)

   
    FALSE TRUE
  0    40    6
  1    14   20

(40 + 20) / nrow(test2)

[1] 0.75

We would like to know if mod2 gives us an edge over mod on the dataset test2.

Using mod, predicting GenSolarBinary for the observation in test2 and report the accuracy below:

predTest2WithMod <- predict(mod, newdata = test2, type = "response")
table(test2$GenSolarBinary, predTest2WithMod > 0.5)

   
    FALSE TRUE
  0    39    7
  1    16   18

(39 + 18) / nrow(test2)

[1] 0.7125

Problem 12 - Evaluating the Performance of the Cluster-the-Predict Algorithm

To compute the overall test-set accuracy of the cluster-the-predict approach, I can combine all the test-set predictions into a single vector “AllPredictions” and all the true outcomes into a single vector “AllOutcomes”.

AllPredictions <- c(predTest1, predTest2)
AllOutcomes <- c(test1$GenSolarBinary, test2$GenSolarBinary)

What is the overall accuracy on the test-set, using the cluster-then-predict approach, again using a threshold of 0.5?

table(AllOutcomes, AllPredictions > 0.5)

           
AllOutcomes FALSE TRUE
          0   154    7
          1    25   24

length(AllOutcomes)

[1] 210

length(AllPredictions)

[1] 210

(154 + 24) / length(AllOutcomes)

[1] 0.847619

R Data Analytics Machine Learning

Rihad Variawa

Data Scientist

I am the Sr. Data Scientist at Malastare AI and head of global Fintech Research, responsible for overall vision and strategy, investment priorities and offering development. Working in the financial services industry, helping clients adopt new technologies that can transform the way they transact and engage with their customers. I am passionate about data science, super inquisitive and challenge seeker; looking at everything through a lens of numbers and problem-solver at the core. From understanding a business problem to collecting and visualizing data, until the stage of prototyping, fine-tuning and deploying models to real-world applications, I find the fulfillment of tackling challenges to solve complex problems using data.