NYC Subway

May 1, 2019

Regression is a statistical technique to analyze the relationship between a response variable and one or more predictors. If the relationship is linear, a straight line can be fit to the data using the following equation- y = mx + c Here, y is the outcome variable, m is the gradient or slope of the line, x is the predictor variable and c is the intercept on the line.

For example, studing the relationship between size of a house in square feet and its price. Linear modelling helps in predicting the price of a house based on the size. Thus, the above equation can be written as- price = m * size + intercept

Implementation

Dataset: Turnstile Data of NYC Subway

df_turnstile <- read.csv("turnstile_data_master_with_weather.csv")
head(df_turnstile)

  X UNIT      DATEn    TIMEn Hour   DESCn ENTRIESn_hourly EXITSn_hourly
1 0 R001 2011-05-01 01:00:00    1 REGULAR               0             0
2 1 R001 2011-05-01 05:00:00    5 REGULAR             217           553
3 2 R001 2011-05-01 09:00:00    9 REGULAR             890          1262
4 3 R001 2011-05-01 13:00:00   13 REGULAR            2451          3708
5 4 R001 2011-05-01 17:00:00   17 REGULAR            4400          2501
6 5 R001 2011-05-01 21:00:00   21 REGULAR            3372          2122
  maxpressurei maxdewpti mindewpti minpressurei meandewpti meanpressurei
1        30.31        42        35        30.23         39         30.27
2        30.31        42        35        30.23         39         30.27
3        30.31        42        35        30.23         39         30.27
4        30.31        42        35        30.23         39         30.27
5        30.31        42        35        30.23         39         30.27
6        30.31        42        35        30.23         39         30.27
  fog rain meanwindspdi mintempi meantempi maxtempi precipi thunder
1   0    0            5       50        60       69       0       0
2   0    0            5       50        60       69       0       0
3   0    0            5       50        60       69       0       0
4   0    0            5       50        60       69       0       0
5   0    0            5       50        60       69       0       0
6   0    0            5       50        60       69       0       0

str(df_turnstile)

## 'data.frame':    131951 obs. of  22 variables:
##  $ X              : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ UNIT           : Factor w/ 465 levels "R001","R002",..: 1 1 1 1 1 1 2 2 2 2 ...
##  $ DATEn          : Factor w/ 30 levels "2011-05-01","2011-05-02",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ TIMEn          : Factor w/ 32261 levels "00:00:00","00:00:03",..: 1253 6369 11995 17547 23222 28407 1253 6369 11995 17547 ...
##  $ Hour           : int  1 5 9 13 17 21 1 5 9 13 ...
##  $ DESCn          : Factor w/ 1 level "REGULAR": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ENTRIESn_hourly: num  0 217 890 2451 4400 ...
##  $ EXITSn_hourly  : num  0 553 1262 3708 2501 ...
##  $ maxpressurei   : num  30.3 30.3 30.3 30.3 30.3 ...
##  $ maxdewpti      : num  42 42 42 42 42 42 42 42 42 42 ...
##  $ mindewpti      : num  35 35 35 35 35 35 35 35 35 35 ...
##  $ minpressurei   : num  30.2 30.2 30.2 30.2 30.2 ...
##  $ meandewpti     : num  39 39 39 39 39 39 39 39 39 39 ...
##  $ meanpressurei  : num  30.3 30.3 30.3 30.3 30.3 ...
##  $ fog            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ rain           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ meanwindspdi   : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ mintempi       : num  50 50 50 50 50 50 50 50 50 50 ...
##  $ meantempi      : num  60 60 60 60 60 60 60 60 60 60 ...
##  $ maxtempi       : num  69 69 69 69 69 69 69 69 69 69 ...
##  $ precipi        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ thunder        : num  0 0 0 0 0 0 0 0 0 0 ...

summary(df_turnstile)

##        X               UNIT              DATEn             TIMEn      
##  Min.   :     0   R549   :12198   2011-05-05:  4510   00:00:00: 7144  
##  1st Qu.: 32988   R550   : 6881   2011-05-12:  4503   04:00:00: 7141  
##  Median : 65975   R541   : 5922   2011-05-19:  4503   20:00:00: 7112  
##  Mean   : 65975   R540   : 4420   2011-05-16:  4493   12:00:00: 7105  
##  3rd Qu.: 98962   R543   : 4146   2011-05-23:  4479   16:00:00: 7067  
##  Max.   :131950   R552   : 3509   2011-05-26:  4479   09:00:00: 6087  
##                   (Other):94875   (Other)   :104984   (Other) :90295  
##       Hour          DESCn        ENTRIESn_hourly EXITSn_hourly    
##  Min.   : 0.0   REGULAR:131951   Min.   :    0   Min.   :    0.0  
##  1st Qu.: 5.0                    1st Qu.:   39   1st Qu.:   32.0  
##  Median :12.0                    Median :  279   Median :  232.0  
##  Mean   :10.9                    Mean   : 1095   Mean   :  886.9  
##  3rd Qu.:17.0                    3rd Qu.: 1109   3rd Qu.:  847.0  
##  Max.   :23.0                    Max.   :51839   Max.   :45249.0  
##                                                                   
##   maxpressurei     maxdewpti       mindewpti      minpressurei  
##  Min.   :29.74   Min.   :39.00   Min.   :22.00   Min.   :29.54  
##  1st Qu.:29.96   1st Qu.:50.00   1st Qu.:38.00   1st Qu.:29.84  
##  Median :30.03   Median :57.00   Median :51.00   Median :29.91  
##  Mean   :30.03   Mean   :57.24   Mean   :48.26   Mean   :29.89  
##  3rd Qu.:30.10   3rd Qu.:64.00   3rd Qu.:55.00   3rd Qu.:29.97  
##  Max.   :30.31   Max.   :70.00   Max.   :66.00   Max.   :30.23  
##                                                                 
##    meandewpti   meanpressurei        fog              rain       
##  Min.   :31.0   Min.   :29.64   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:45.0   1st Qu.:29.91   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :54.0   Median :29.96   Median :0.0000   Median :0.0000  
##  Mean   :52.7   Mean   :29.97   Mean   :0.1671   Mean   :0.3342  
##  3rd Qu.:60.0   3rd Qu.:30.05   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :68.0   Max.   :30.27   Max.   :1.0000   Max.   :1.0000  
##                                                                  
##   meanwindspdi       mintempi       meantempi        maxtempi    
##  Min.   : 1.000   Min.   :46.00   Min.   :55.00   Min.   :58.00  
##  1st Qu.: 5.000   1st Qu.:52.00   1st Qu.:60.00   1st Qu.:65.00  
##  Median : 5.000   Median :54.00   Median :63.00   Median :71.00  
##  Mean   : 5.543   Mean   :56.17   Mean   :64.27   Mean   :71.77  
##  3rd Qu.: 6.000   3rd Qu.:60.00   3rd Qu.:68.00   3rd Qu.:78.00  
##  Max.   :12.000   Max.   :70.00   Max.   :78.00   Max.   :86.00  
##                                                                  
##     precipi          thunder 
##  Min.   :0.0000   Min.   :0  
##  1st Qu.:0.0000   1st Qu.:0  
##  Median :0.0000   Median :0  
##  Mean   :0.1723   Mean   :0  
##  3rd Qu.:0.1000   3rd Qu.:0  
##  Max.   :2.1800   Max.   :0  
##

The aim is to predict Entriesn_hourly i.e. number of hourly entries at the NYC Subway based on other predictor variables.

Some of the predictor variables are not of any use; X is the row number which won’t be useful in analysis.

DESCn only has one value in the entire dataset and can be omitted.

df_turnstile$X <- NULL
df_turnstile$DESCn <- NULL

Exploratory Data Analysis:

Scatterplots are a good way of understand the linear relationship between two variables. Following is a scatterplot between the Hour of the day and the number of entries, colored by whether there was rain or not.

ggplot(df_turnstile, aes(x = as.Date(DATEn), y = ENTRIESn_hourly)) + geom_point() + scale_x_date(breaks = seq(as.Date("2011-05-01"), as.Date("2011-05-30"), by = "1 day"),labels=date_format("%d")) +
  theme_classic()

This first graph shows the distribution of the number of entries over the month of May 2011. The distribution shows 5 days of high entries and 2 days where it is comparitively less; explaining the weekdays and weekend entries.

Studying the number of turnstile entries based on the Hour of the day.

ggplot(df_turnstile, aes(x = Hour, y = ENTRIESn_hourly)) + geom_point(color = "red", alpha = 0.1) + scale_x_continuous(limits = c(0,25), breaks = seq(0,25,1)) +
  theme_classic()

This graph shows the number of entries throughout the day. (Hour 1 is the hour from Midnight to 12.59 am, Hour 2 is 1am to 1.59 am and so on)

The transparency in the graph shows the level of number of entries; for example, the darker the red color, the more number of data points lie at that point in the graph.

There are some spikes in the data like at Hour 1, hour 13 and Hour 22.

The next step would be to investigate this relationship further and see how other factors come into play.

df_turnstile$DATEn <- as.Date(df_turnstile$DATEn)

df_turnstile$dayofweek <- weekdays(df_turnstile$DATEn)

ggplot(data = df_turnstile, aes(x = Hour, y = ENTRIESn_hourly, color = dayofweek)) + geom_point() + scale_x_continuous(limits = c(0,25), breaks = seq(0,25,1)) +
  theme_classic()

The above graph is an add on to the previous one. The graph is colored by the day of the week. Now, it can be seen that the spike at Hour 1 is on Saturday. (Maybe people returning home after Saturday night partying).

However, the spikes at Hour 13 and Hour 22 are on Weekdays (Hour 13 is Lunch hour and people going home from work between 10 and 11 pm may be the explanation).

ggplot(data = df_turnstile, aes(x = factor(dayofweek), y = ENTRIESn_hourly)) + geom_point() +
  theme_classic()

This graph shows the distribution of the number of entries over the week. Weekdays have a similar number of entries whereas weekends are less than the rest. This makes sense intuitively since people take the subway on weekdays to commute to work; whereas the frequency of trains is usually less on weekends hence the lesser number of entries.

The next step is to study how weather conditions affect these entires over the week.

ggplot(df_turnstile, aes(x = Hour, y = ENTRIESn_hourly, color = factor(rain))) + geom_point() + scale_x_continuous(limits = c(0,25), breaks = seq(0,25,1)) +
  theme_classic()

This graph explains the spikes at Hour 13 and Hour 22; the reason being more people using the subway due to the rain.

ggplot(data = df_turnstile, aes(x = maxtempi, y = ENTRIESn_hourly, shape = factor(rain))) + geom_point(aes(color = Hour)) + scale_x_continuous(limits = c(55,90), breaks = seq(55,90,5)) +
  theme_classic()

Building a correlation matrix:

A correlation matrix is the best way to understand if two factors are collinear. It calculates the correlation coefficient between each factor and others. Correlation basically shows how change in one factor is likely to affect another.

The correlation coefficient is a number between -1 and 1. A value of 0 denotes no correlation between two factors; coeffiecient 1 denotes complete position correlation and a value of -1 denotes complete negative correlation.

Thus, if two predictor variables are highly correlated, positively or negatively (i.e. >=0.5 or <=(-0.5)), considering both in the analyses will only increase complexity without giving any real value. In such cases, any one of the factors can be omitted.

df_turnstile$rain <- as.numeric(df_turnstile$rain)
df_turnstile$Hour <- as.numeric(df_turnstile$Hour)
nums <- sapply(df_turnstile, is.numeric)
correlation_matrix <- cor(df_turnstile[, nums], method = "spearman")

Warning in cor(df_turnstile[, nums], method = "spearman"): the standard
deviation is zero

require(corrplot)
corrplot(correlation_matrix, method = "color", type = "lower")

From the above plot, it is evident that rain and fog are highly correlated(0.44) and hence either one of them can be neglected while building the model.

Significance Tests

Chi Square Test/ Pearsons Chi square test

Goodness of fit

Independence of variables (if dependent and independent are statistically related)

Mann Whitney U test

Building a model:

Split data into train and test set.

df_turnstile$rain <- as.factor(df_turnstile$rain)
df_turnstile$Hour <- as.numeric(df_turnstile$Hour)
set.seed(1234)

require(caTools)

Loading required package: caTools

sample <- sample.split(df_turnstile, SplitRatio = 0.75)

df_train <- subset(df_turnstile, sample == TRUE)
df_test <- subset(df_turnstile, sample == FALSE)


f = ENTRIESn_hourly ~ Hour + meandewpti + meanpressurei + meandewpti + rain + precipi + meantempi

linear_model1 <- lm(formula = f, data = df_train)

summary(linear_model1)


Call:
lm(formula = f, data = df_train)

Residuals:
   Min     1Q Median     3Q    Max 
 -1927   -963   -595     29  50217 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   10114.589   1796.834   5.629 1.82e-08 ***
Hour             59.503      1.088  54.693  < 2e-16 ***
meandewpti       -1.167      1.272  -0.917 0.358964    
meanpressurei  -306.446     59.626  -5.139 2.76e-07 ***
rain1           -36.115     23.274  -1.552 0.120718    
precipi          58.822     21.466   2.740 0.006142 ** 
meantempi        -6.551      1.808  -3.623 0.000292 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2301 on 94244 degrees of freedom
Multiple R-squared:  0.03157,   Adjusted R-squared:  0.03151 
F-statistic: 512.1 on 6 and 94244 DF,  p-value: < 2.2e-16

f = ENTRIESn_hourly ~ Hour + maxdewpti + maxpressurei + maxdewpti + rain + precipi + maxtempi

linear_model2 <- lm(formula = f, data = df_train)

summary(linear_model2)


Call:
lm(formula = f, data = df_train)

Residuals:
   Min     1Q Median     3Q    Max 
 -1928   -962   -594     27  50189 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  10390.409   1861.345   5.582 2.38e-08 ***
Hour            59.505      1.088  54.687  < 2e-16 ***
maxdewpti       -2.352      1.220  -1.928  0.05387 .  
maxpressurei  -320.621     61.591  -5.206 1.94e-07 ***
rain1          -16.255     22.766  -0.714  0.47522    
precipi         56.349     21.568   2.613  0.00899 ** 
maxtempi        -2.561      1.329  -1.927  0.05396 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2301 on 94244 degrees of freedom
Multiple R-squared:  0.03131,   Adjusted R-squared:  0.03125 
F-statistic: 507.7 on 6 and 94244 DF,  p-value: < 2.2e-16

f = ENTRIESn_hourly ~ Hour + maxdewpti + maxpressurei + maxdewpti + rain + precipi + maxtempi + 
                      mindewpti + minpressurei + mintempi + 
                      meandewpti + meanpressurei + meantempi
linear_model3 <- lm(formula = f, data = df_train)

summary(linear_model3)


Call:
lm(formula = f, data = df_train)

Residuals:
   Min     1Q Median     3Q    Max 
 -2169   -977   -579     36  50233 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   12754.788   2104.365   6.061 1.36e-09 ***
Hour             59.561      1.086  54.823  < 2e-16 ***
maxdewpti        21.654      6.591   3.285  0.00102 ** 
maxpressurei  -2259.680    285.300  -7.920 2.39e-15 ***
rain1           -81.456     26.213  -3.107  0.00189 ** 
precipi          22.064     23.186   0.952  0.34130    
maxtempi         35.348     16.576   2.132  0.03297 *  
mindewpti         5.976      6.068   0.985  0.32470    
minpressurei  -1710.971    227.800  -7.511 5.93e-14 ***
mintempi        -29.708     17.853  -1.664  0.09611 .  
meandewpti       -6.414     11.166  -0.574  0.56567    
meanpressurei  3589.144    404.039   8.883  < 2e-16 ***
meantempi       -44.866     33.218  -1.351  0.17681    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2298 on 94238 degrees of freedom
Multiple R-squared:  0.03437,   Adjusted R-squared:  0.03425 
F-statistic: 279.5 on 12 and 94238 DF,  p-value: < 2.2e-16

f = ENTRIESn_hourly ~ rain + precipi + maxtempi + 
                      mindewpti + mintempi + 
                      meandewpti + meantempi
linear_model3 <- lm(formula = f, data = df_train)

summary(linear_model3)


Call:
lm(formula = f, data = df_train)

Residuals:
   Min     1Q Median     3Q    Max 
 -1389  -1027   -798     24  50810 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1421.505     96.171  14.781  < 2e-16 ***
rain1       -121.264     25.048  -4.841 1.29e-06 ***
precipi       65.162     21.948   2.969  0.00299 ** 
maxtempi      16.947     16.116   1.052  0.29298    
mindewpti    -24.309      4.339  -5.602 2.13e-08 ***
mintempi     -32.259     17.815  -1.811  0.07018 .  
meandewpti    41.276      5.303   7.784 7.10e-15 ***
meantempi    -10.937     32.776  -0.334  0.73861    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2336 on 94243 degrees of freedom
Multiple R-squared:  0.00199,   Adjusted R-squared:  0.001916 
F-statistic: 26.85 on 7 and 94243 DF,  p-value: < 2.2e-16

Predict

The predict() function is used to test the model fit against the test set. The difference between the predicted values and actual values will help determine the accuracy of the model.

predict_linear <- predict(linear_model3, newdata = df_test[, c("rain", "precipi", "maxtempi", 
                                                              "mindewpti", "mintempi","meandewpti", "meantempi")])

mse <- mean((df_test$ENTRIESn_hourly - predict_linear)^2)
sqrt(mse)

[1] 2331.393

Stepwise Regression

Stepwise Regression can be used to find the best subset of factors instead of manually modifying the formula each time. Stepwise regression performs feature selection and builds the model. The subset is selected based on a parameter called AIC(Akaike Information Criterion) and RSS(Residual Sum of Squares)

f = ENTRIESn_hourly ~ Hour + maxdewpti + maxpressurei + maxdewpti + rain + precipi + maxtempi + 
                      mindewpti + minpressurei + mintempi + 
                      meandewpti + meanpressurei + meantempi
model_stepwise <- lm(f, data = df_train)

fit <- step(model_stepwise)

Start:  AIC=1458969
ENTRIESn_hourly ~ Hour + maxdewpti + maxpressurei + maxdewpti + 
    rain + precipi + maxtempi + mindewpti + minpressurei + mintempi + 
    meandewpti + meanpressurei + meantempi

                Df  Sum of Sq        RSS     AIC
- meandewpti     1 1.7424e+06 4.9760e+11 1458967
- precipi        1 4.7816e+06 4.9760e+11 1458968
- mindewpti      1 5.1214e+06 4.9760e+11 1458968
- meantempi      1 9.6324e+06 4.9761e+11 1458969
<none>                        4.9760e+11 1458969
- mintempi       1 1.4621e+07 4.9761e+11 1458970
- maxtempi       1 2.4011e+07 4.9762e+11 1458972
- rain           1 5.0989e+07 4.9765e+11 1458977
- maxdewpti      1 5.6986e+07 4.9765e+11 1458978
- minpressurei   1 2.9787e+08 4.9789e+11 1459024
- maxpressurei   1 3.3124e+08 4.9793e+11 1459030
- meanpressurei  1 4.1666e+08 4.9801e+11 1459046
- Hour           1 1.5870e+10 5.1347e+11 1461926

Step:  AIC=1458967
ENTRIESn_hourly ~ Hour + maxdewpti + maxpressurei + rain + precipi + 
    maxtempi + mindewpti + minpressurei + mintempi + meanpressurei + 
    meantempi

                Df  Sum of Sq        RSS     AIC
- precipi        1 4.3774e+06 4.9760e+11 1458966
- mindewpti      1 6.1022e+06 4.9760e+11 1458967
- meantempi      1 9.8265e+06 4.9761e+11 1458967
<none>                        4.9760e+11 1458967
- mintempi       1 1.4824e+07 4.9761e+11 1458968
- maxtempi       1 2.4880e+07 4.9762e+11 1458970
- rain           1 5.3047e+07 4.9765e+11 1458976
- maxdewpti      1 1.3688e+08 4.9774e+11 1458991
- minpressurei   1 2.9613e+08 4.9789e+11 1459022
- maxpressurei   1 3.5576e+08 4.9795e+11 1459033
- meanpressurei  1 4.2195e+08 4.9802e+11 1459045
- Hour           1 1.5870e+10 5.1347e+11 1461924

Step:  AIC=1458966
ENTRIESn_hourly ~ Hour + maxdewpti + maxpressurei + rain + maxtempi + 
    mindewpti + minpressurei + mintempi + meanpressurei + meantempi

                Df  Sum of Sq        RSS     AIC
- mindewpti      1 6.5889e+06 4.9761e+11 1458966
- meantempi      1 8.7840e+06 4.9761e+11 1458966
<none>                        4.9760e+11 1458966
- mintempi       1 1.7097e+07 4.9762e+11 1458968
- maxtempi       1 2.3165e+07 4.9763e+11 1458969
- rain           1 4.8873e+07 4.9765e+11 1458974
- maxdewpti      1 1.4351e+08 4.9775e+11 1458991
- minpressurei   1 2.9223e+08 4.9789e+11 1459020
- maxpressurei   1 4.2009e+08 4.9802e+11 1459044
- meanpressurei  1 4.4393e+08 4.9805e+11 1459048
- Hour           1 1.5870e+10 5.1347e+11 1461923

Step:  AIC=1458966
ENTRIESn_hourly ~ Hour + maxdewpti + maxpressurei + rain + maxtempi + 
    minpressurei + mintempi + meanpressurei + meantempi

                Df  Sum of Sq        RSS     AIC
- meantempi      1 6.6694e+06 4.9762e+11 1458965
<none>                        4.9761e+11 1458966
- mintempi       1 1.8191e+07 4.9763e+11 1458967
- maxtempi       1 1.8969e+07 4.9763e+11 1458967
- rain           1 5.0144e+07 4.9766e+11 1458973
- minpressurei   1 3.0167e+08 4.9791e+11 1459021
- maxpressurei   1 4.1365e+08 4.9802e+11 1459042
- meanpressurei  1 4.4210e+08 4.9805e+11 1459047
- maxdewpti      1 4.4868e+08 4.9806e+11 1459048
- Hour           1 1.5869e+10 5.1348e+11 1461922

Step:  AIC=1458965
ENTRIESn_hourly ~ Hour + maxdewpti + maxpressurei + rain + maxtempi + 
    minpressurei + mintempi + meanpressurei

                Df  Sum of Sq        RSS     AIC
<none>                        4.9762e+11 1458965
- rain           1 5.6258e+07 4.9767e+11 1458973
- maxtempi       1 2.9480e+08 4.9791e+11 1459019
- minpressurei   1 2.9848e+08 4.9791e+11 1459019
- maxpressurei   1 4.0799e+08 4.9802e+11 1459040
- meanpressurei  1 4.3771e+08 4.9805e+11 1459046
- maxdewpti      1 5.1002e+08 4.9813e+11 1459059
- mintempi       1 1.1998e+09 4.9882e+11 1459190
- Hour           1 1.5869e+10 5.1348e+11 1461921

summary(fit)


Call:
lm(formula = ENTRIESn_hourly ~ Hour + maxdewpti + maxpressurei + 
    rain + maxtempi + minpressurei + mintempi + meanpressurei, 
    data = df_train)

Residuals:
   Min     1Q Median     3Q    Max 
 -2188   -978   -580     37  50211 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   12084.136   1885.501   6.409 1.47e-10 ***
Hour             59.558      1.086  54.821  < 2e-16 ***
maxdewpti        22.558      2.295   9.828  < 2e-16 ***
maxpressurei  -2316.325    263.511  -8.790  < 2e-16 ***
rain1           -79.208     24.266  -3.264   0.0011 ** 
maxtempi         12.496      1.672   7.472 7.96e-14 ***
minpressurei  -1580.716    210.243  -7.519 5.59e-14 ***
mintempi        -52.348      3.473 -15.074  < 2e-16 ***
meanpressurei  3536.013    388.370   9.105  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2298 on 94242 degrees of freedom
Multiple R-squared:  0.03433,   Adjusted R-squared:  0.03425 
F-statistic: 418.8 on 8 and 94242 DF,  p-value: < 2.2e-16

Residual Sum of Squares

It is the sum of the squared difference between the predicted and the actual values. Lower the RSS, better the fit.

AIC- Akaike Information Criterion

According to Wikipedia, IC provides a means for model selection. It estimates the quality of the model, relative to other models.

AIC enables comparison of different models built on the same dataset and selects the best. A lower AIC value signifies a better model. However, it is a relative comparison technique i.e. it gives the best model with respect to others; however there is no way to determine if the best model is actually applicable and useful.

In addition, the AIC also penalises extra variables and hence gives preference to simpler models.

However, AIC does not help in solving the problem of multicollinearity. Thus, the factors that are collinear must first be dealt with before performing stepwise regression and using AIC to select best fit.

From the above summary, the model with best fit is the one that uses the formula- ENTRIESn_hourly ~ Hour + maxdewpti + maxpressurei + rain + maxtempi + minpressurei + mintempi + meanpressurei + meantempi

f = ENTRIESn_hourly ~ Hour + maxdewpti + maxpressurei + rain + maxtempi + 
    minpressurei + mintempi + meanpressurei + meantempi

model_final  <- lm(f, data = df_train)

predict_final <- predict(model_final, newdata = df_test[, c("Hour", "maxdewpti", "maxpressurei",                                                                 "rain", "maxtempi", "minpressurei",                                                                "mintempi", "meanpressurei", "meantempi")])

mse <- mean((df_test$ENTRIESn_hourly - predict_final)^2)
sqrt(mse)

[1] 2292.802

Thus, the RMSE went down from 2331 to 2292 which shows that this model is better than the previous one.

R regression Machine Learning

Rihad Variawa

Data Scientist

I am the Sr. Data Scientist at Malastare AI and head of global Fintech Research, responsible for overall vision and strategy, investment priorities and offering development. Working in the financial services industry, helping clients adopt new technologies that can transform the way they transact and engage with their customers. I am passionate about data science, super inquisitive and challenge seeker; looking at everything through a lens of numbers and problem-solver at the core. From understanding a business problem to collecting and visualizing data, until the stage of prototyping, fine-tuning and deploying models to real-world applications, I find the fulfillment of tackling challenges to solve complex problems using data.