Popularity of music records
The music industry has a well-developed market with a global annual revenue around $15 billion. The recording industry is highly competitive and is dominated by three big production companies which make up nearly 82% of the total annual album sales.
Artists are at the core of the music industry and record labels provide them with the necessary resources to sell their music on a large scale. A record label incurs numerous costs (studio recording, marketing, distribution, and touring) in exchange for a percentage of the profits from album sales, singles and concert tickets.
Unfortunately, the success of an artist’s release is highly uncertain: a single may be extremely popular, resulting in widespread radio play and digital downloads, while another single may turn out quite unpopular, and therefore unprofitable.
Knowing the competitive nature of the recording industry, record labels face the fundamental decision problem of which musical releases to support to maximize their financial success.
How can we use analytics to predict the popularity of a song? In this project, we challenge ourselves to predict whether a song will reach a spot in the Top 10 of the Billboard Hot 100 Chart.
Taking an analytics approach, we aim to use information about a song’s properties to predict its popularity. The dataset songs.csv consists of all songs which made it to the Top 10 of the Billboard Hot 100 Chart from 1990-2010 plus a sample of additional songs that didn’t make the Top 10. This data comes from three sources: Wikipedia, Billboard.com, and EchoNest.
The variables included in the dataset either describe the artist or the song, or they are associated with the following song attributes: time signature, loudness, key, pitch, tempo, and timbre.
Here’s a detailed description of the variables:
- year = the year the song was released
- songtitle = the title of the song
- artistname = the name of the artist of the song
- songID and artistID = identifying variables for the song and artist
- timesignature and timesignature_confidence = a variable estimating the time signature of the song, and the confidence in the estimate
- loudness = a continuous variable indicating the average amplitude of the audio in decibels
- tempo and tempo_confidence = a variable indicating the estimated beats per minute of the song, and the confidence in the estimate
- key and key_confidence = a variable with twelve levels indicating the estimated key of the song (C, C#, . . ., B), and the confidence in the estimate
- energy = a variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
- pitch = a continuous variable that indicates the pitch of the song timbre_0_min, timbre_0_max, timbre_1_min, timbre_1_max, . . . , timbre_11_min, and timbre_11_max = variables that indicate the minimum/maximum values over all segments for each of the twelve values in the timbre vector (resulting in 24 continuous variables)
- Top10 = a binary variable indicating whether or not the song made it to the Top 10 of the Billboard Hot 100 Chart (1 if it was in the top 10, and 0 if it was not)
Problem 1.1 - Understanding the Data
Use the read.csv function to load the dataset “songs.csv” into R. How many observations (songs) are from the year 2010?
songs <- read.csv("songs.csv")
str(songs)
'data.frame': 7574 obs. of 39 variables:
$ year : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
$ songtitle : Factor w/ 7141 levels "̈́ l'or_e des bois",..: 6204 5522 241 3115 48 608 255 4419 2886 6756 ...
$ artistname : Factor w/ 1032 levels "50 Cent","98 Degrees",..: 3 3 3 3 3 3 3 3 3 12 ...
$ songID : Factor w/ 7549 levels "SOAACNI1315CD4AC42",..: 595 5439 5252 1716 3431 1020 1831 3964 6904 2473 ...
$ artistID : Factor w/ 1047 levels "AR00B1I1187FB433EB",..: 671 671 671 671 671 671 671 671 671 507 ...
$ timesignature : int 3 4 4 4 4 4 4 4 4 4 ...
$ timesignature_confidence: num 0.853 1 1 1 0.788 1 0.968 0.861 0.622 0.938 ...
$ loudness : num -4.26 -4.05 -3.57 -3.81 -4.71 ...
$ tempo : num 91.5 140 160.5 97.5 140.1 ...
$ tempo_confidence : num 0.953 0.921 0.489 0.794 0.286 0.347 0.273 0.83 0.018 0.929 ...
$ key : int 11 10 2 1 6 4 10 5 9 11 ...
$ key_confidence : num 0.453 0.469 0.209 0.632 0.483 0.627 0.715 0.423 0.751 0.602 ...
$ energy : num 0.967 0.985 0.99 0.939 0.988 ...
$ pitch : num 0.024 0.025 0.026 0.013 0.063 0.038 0.026 0.033 0.027 0.004 ...
$ timbre_0_min : num 0.002 0 0.003 0 0 ...
$ timbre_0_max : num 57.3 57.4 57.4 57.8 56.9 ...
$ timbre_1_min : num -6.5 -37.4 -17.2 -32.1 -223.9 ...
$ timbre_1_max : num 171 171 171 221 171 ...
$ timbre_2_min : num -81.7 -149.6 -72.9 -138.6 -147.2 ...
$ timbre_2_max : num 95.1 180.3 157.9 173.4 166 ...
$ timbre_3_min : num -285 -380.1 -204 -73.5 -128.1 ...
$ timbre_3_max : num 259 384 251 373 389 ...
$ timbre_4_min : num -40.4 -48.7 -66 -55.6 -43.9 ...
$ timbre_4_max : num 73.6 100.4 152.1 119.2 99.3 ...
$ timbre_5_min : num -104.7 -87.3 -98.7 -77.5 -96.1 ...
$ timbre_5_max : num 183.1 42.8 141.4 141.2 38.3 ...
$ timbre_6_min : num -88.8 -86.9 -88.9 -70.8 -110.8 ...
$ timbre_6_max : num 73.5 75.5 66.5 64.5 72.4 ...
$ timbre_7_min : num -71.1 -65.8 -67.4 -63.7 -55.9 ...
$ timbre_7_max : num 82.5 106.9 80.6 96.7 110.3 ...
$ timbre_8_min : num -52 -61.3 -59.8 -78.7 -56.5 ...
$ timbre_8_max : num 39.1 35.4 46 41.1 37.6 ...
$ timbre_9_min : num -35.4 -81.9 -46.3 -49.2 -48.6 ...
$ timbre_9_max : num 71.6 74.6 59.9 95.4 67.6 ...
$ timbre_10_min : num -126.4 -103.8 -108.3 -102.7 -52.8 ...
$ timbre_10_max : num 18.7 121.9 33.3 46.4 22.9 ...
$ timbre_11_min : num -44.8 -38.9 -43.7 -59.4 -50.4 ...
$ timbre_11_max : num 26 22.5 25.7 37.1 32.8 ...
$ Top10 : int 0 0 0 0 0 0 0 0 0 1 ...
summary(songs)
year songtitle artistname
Min. :1990 Intro : 15 Various artists: 162
1st Qu.:1997 Forever : 8 Anal Cunt : 49
Median :2002 Home : 7 Various Artists: 44
Mean :2001 Goodbye : 6 Tori Amos : 41
3rd Qu.:2006 Again : 5 Eels : 37
Max. :2010 Beautiful: 5 Napalm Death : 37
(Other) :7528 (Other) :7204
songID artistID timesignature
SOALSZJ1370F1A7C75: 2 ARAGWS81187FB3F768: 222 Min. :0.000
SOANPAC13936E0B640: 2 ARL14X91187FB4CF14: 49 1st Qu.:4.000
SOBDGMX12B0B80808E: 2 AR4KS8C1187FB4CF3D: 41 Median :4.000
SOBUDCZ12A58A80013: 2 AR0JZZ01187B9B2C99: 37 Mean :3.894
SODFRLK13134387FB5: 2 ARZGTK71187B9AC7F5: 37 3rd Qu.:4.000
SOEJPOK12A6D4FAFE4: 2 AR95XYH1187FB53951: 31 Max. :7.000
(Other) :7562 (Other) :7157
timesignature_confidence loudness tempo
Min. :0.0000 Min. :-42.451 Min. : 0.00
1st Qu.:0.8193 1st Qu.:-10.847 1st Qu.: 88.86
Median :0.9790 Median : -7.649 Median :103.27
Mean :0.8533 Mean : -8.817 Mean :107.35
3rd Qu.:1.0000 3rd Qu.: -5.640 3rd Qu.:124.80
Max. :1.0000 Max. : 1.305 Max. :244.31
tempo_confidence key key_confidence energy
Min. :0.0000 Min. : 0.000 Min. :0.0000 Min. :0.00002
1st Qu.:0.3720 1st Qu.: 2.000 1st Qu.:0.2040 1st Qu.:0.50014
Median :0.7015 Median : 6.000 Median :0.4515 Median :0.71816
Mean :0.6229 Mean : 5.385 Mean :0.4338 Mean :0.67547
3rd Qu.:0.8920 3rd Qu.: 9.000 3rd Qu.:0.6460 3rd Qu.:0.88740
Max. :1.0000 Max. :11.000 Max. :1.0000 Max. :0.99849
pitch timbre_0_min timbre_0_max timbre_1_min
Min. :0.00000 Min. : 0.000 Min. :12.58 Min. :-333.72
1st Qu.:0.00300 1st Qu.: 0.000 1st Qu.:53.12 1st Qu.:-160.12
Median :0.00700 Median : 0.027 Median :55.53 Median :-107.75
Mean :0.01082 Mean : 4.123 Mean :54.46 Mean :-110.79
3rd Qu.:0.01400 3rd Qu.: 2.772 3rd Qu.:57.08 3rd Qu.: -59.71
Max. :0.54100 Max. :48.353 Max. :64.01 Max. : 123.73
timbre_1_max timbre_2_min timbre_2_max timbre_3_min
Min. :-74.37 Min. :-324.86 Min. : -0.832 Min. :-495.36
1st Qu.:171.13 1st Qu.:-167.64 1st Qu.:100.519 1st Qu.:-226.87
Median :194.40 Median :-136.60 Median :129.908 Median :-170.61
Mean :212.34 Mean :-136.89 Mean :136.673 Mean :-186.11
3rd Qu.:239.24 3rd Qu.:-106.51 3rd Qu.:166.121 3rd Qu.:-131.56
Max. :549.97 Max. : 34.57 Max. :397.095 Max. : -21.55
timbre_3_max timbre_4_min timbre_4_max timbre_5_min
Min. : 12.85 Min. :-207.07 Min. : -0.651 Min. :-262.48
1st Qu.:127.14 1st Qu.: -77.69 1st Qu.: 83.966 1st Qu.:-113.58
Median :189.50 Median : -63.83 Median :107.422 Median : -95.47
Mean :211.81 Mean : -65.28 Mean :108.227 Mean :-104.00
3rd Qu.:290.72 3rd Qu.: -51.34 3rd Qu.:130.286 3rd Qu.: -81.02
Max. :499.62 Max. : 51.43 Max. :257.801 Max. : -42.17
timbre_5_max timbre_6_min timbre_6_max timbre_7_min
Min. :-22.41 Min. :-152.170 Min. : 12.70 Min. :-214.791
1st Qu.: 84.64 1st Qu.: -94.792 1st Qu.: 59.04 1st Qu.:-101.171
Median :119.90 Median : -80.418 Median : 70.47 Median : -81.797
Mean :127.04 Mean : -80.944 Mean : 72.17 Mean : -84.313
3rd Qu.:162.34 3rd Qu.: -66.521 3rd Qu.: 83.19 3rd Qu.: -64.301
Max. :350.94 Max. : 4.503 Max. :208.39 Max. : 5.153
timbre_7_max timbre_8_min timbre_8_max timbre_9_min
Min. : 15.70 Min. :-158.756 Min. :-25.95 Min. :-149.51
1st Qu.: 76.50 1st Qu.: -73.051 1st Qu.: 40.58 1st Qu.: -70.28
Median : 94.63 Median : -62.661 Median : 49.22 Median : -58.65
Mean : 95.65 Mean : -63.704 Mean : 50.06 Mean : -59.52
3rd Qu.:112.71 3rd Qu.: -52.983 3rd Qu.: 58.46 3rd Qu.: -47.70
Max. :214.82 Max. : -2.382 Max. :144.99 Max. : 1.14
timbre_9_max timbre_10_min timbre_10_max timbre_11_min
Min. : 8.415 Min. :-208.82 Min. : -6.359 Min. :-145.599
1st Qu.: 53.037 1st Qu.:-105.13 1st Qu.: 39.196 1st Qu.: -58.058
Median : 65.935 Median : -83.07 Median : 50.895 Median : -50.892
Mean : 68.028 Mean : -87.34 Mean : 55.521 Mean : -50.868
3rd Qu.: 81.267 3rd Qu.: -64.52 3rd Qu.: 66.593 3rd Qu.: -43.292
Max. :161.518 Max. : -10.64 Max. :192.417 Max. : -6.497
timbre_11_max Top10
Min. : 7.20 Min. :0.0000
1st Qu.: 38.98 1st Qu.:0.0000
Median : 46.44 Median :0.0000
Mean : 47.49 Mean :0.1477
3rd Qu.: 55.03 3rd Qu.:0.0000
Max. :110.27 Max. :1.0000
table(songs$year)
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
328 196 186 324 198 258 178 329 380 357 363 282 518 434 479
2005 2006 2007 2008 2009 2010
392 479 622 415 483 373
373
Problem 1.2 - Understanding the Data
How many songs does the dataset include for which the artist name is “Michael Jackson”?
nrow(subset(songs, artistname == "Michael Jackson"))
[1] 18
Problem 1.3 - Understanding the Data
Which of these songs by Michael Jackson made it to the Top 10? Select all that apply.
subset(songs,
artistname == "Michael Jackson" & Top10 == 1,
select = c(artistname, songtitle))
artistname songtitle
4329 Michael Jackson You Rock My World
6207 Michael Jackson You Are Not Alone
6210 Michael Jackson Black or White
6218 Michael Jackson Remember the Time
6915 Michael Jackson In The Closet
You Rock My World, You Are Not Alone
Problem 1.4 - Understanding the Data
The variable corresponding to the estimated time signature (timesignature) is discrete, meaning that it only takes integer values (0, 1, 2, 3, . . . ). What are the values of this variable that occur in our dataset?
summary(songs$timesignature)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 4.000 4.000 3.894 4.000 7.000
table(songs$timesignature)
0 1 3 4 5 7
10 143 503 6787 112 19
Which timesignature value is the most frequent among songs in our dataset? #### 4
Problem 1.5 - Understanding the Data
Out of all of the songs in our dataset, the song with the highest tempo is one of the following songs.
Which one is it?
summary(songs$tempo)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 88.86 103.27 107.35 124.80 244.31
which.max(songs$tempo)
[1] 6206
songs$tempo[6206]
[1] 244.307
nrow(subset(songs, tempo == 244.307))
[1] 1
songs$songtitle[6206]
[1] Wanna Be Startin' Somethin'
7141 Levels: ̈́ l'or_e des bois _\x84_ _\x84\x8d ... Zumbi
Wanna Be Startin’ Somethin’
Problem 2.1 - Creating Our Prediction Model
We wish to predict whether or not a song will make it to the Top 10. To do this, first use the subset function to split the data into a training set “SongsTrain” consisting of all the observations up to and including 2009 song releases, and a testing set “SongsTest”, consisting of the 2010 song releases.
How many observations (songs) are in the training set?
SongsTrain <- subset(songs, year <= 2009)
SongsTest <- subset(songs, year == 2010)
nrow(songs)
[1] 7574
nrow(SongsTrain) + nrow(SongsTest)
[1] 7574
Problem 2.2 - Creating our Prediction Model
In this problem, our outcome variable is “Top10” - we are trying to predict whether or not a song will make it to the Top 10 of the Billboard Hot 100 Chart.
Since the outcome variable is binary, we will build a logistic regression model. We’ll start by using all song attributes as our independent variables, which we’ll call Model 1. We will only use the variables in our dataset that describe the numerical attributes of the song in our logistic regression model.
So we won’t use the variables “year”, “songtitle”, “artistname”, “songID” or “artistID”. We have seen in the lecture that, to build the logistic regression model, we would normally explicitly input the formula including all the independent variables in R. However, in this case, this is a tedious amount of work since we have a large number of independent variables. There is a nice trick to avoid doing so. Let’s suppose that, except for the outcome variable Top10, all other variables in the training set are inputs to Model 1. Then, we can use the formula SongsLog1 = glm(Top10 ~ ., data=SongsTrain, family=binomial) to build our model. Notice that the “.” is used in place of enumerating all the independent variables. (Also, keep in mind that you can choose to put quotes around binomial, or leave out the quotes. R can understand this argument either way.) However, in our case, we want to exclude some of the variables in our dataset from being used as independent variables (“year”, “songtitle”, “artistname”, “songID”, and “artistID”).
To do this, we can use the following trick. First define a vector of variable names called nonvars - these are the variables that we won’t use in our model.
nonvars = c("year", "songtitle", "artistname", "songID", "artistID")
To remove these variables from our training and testing sets.
SongsTrain = SongsTrain[ , !(names(SongsTrain) %in% nonvars) ]
SongsTest = SongsTest[ , !(names(SongsTest) %in% nonvars) ]
Now, use the glm function to build a logistic regression model to predict Top10 using all of the other variables as the independent variables. You should use SongsTrain to build the model.
Looking at the summary of your model, what is the value of the Akaike Information Criterion (AIC)?
SongsLog1 <- glm(Top10 ~ ., data = SongsTrain, family=binomial)
summary(SongsLog1)
Call:
glm(formula = Top10 ~ ., family = binomial, data = SongsTrain)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9220 -0.5399 -0.3459 -0.1845 3.0770
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.470e+01 1.806e+00 8.138 4.03e-16 ***
timesignature 1.264e-01 8.674e-02 1.457 0.145050
timesignature_confidence 7.450e-01 1.953e-01 3.815 0.000136 ***
loudness 2.999e-01 2.917e-02 10.282 < 2e-16 ***
tempo 3.634e-04 1.691e-03 0.215 0.829889
tempo_confidence 4.732e-01 1.422e-01 3.329 0.000873 ***
key 1.588e-02 1.039e-02 1.529 0.126349
key_confidence 3.087e-01 1.412e-01 2.187 0.028760 *
energy -1.502e+00 3.099e-01 -4.847 1.25e-06 ***
pitch -4.491e+01 6.835e+00 -6.570 5.02e-11 ***
timbre_0_min 2.316e-02 4.256e-03 5.441 5.29e-08 ***
timbre_0_max -3.310e-01 2.569e-02 -12.882 < 2e-16 ***
timbre_1_min 5.881e-03 7.798e-04 7.542 4.64e-14 ***
timbre_1_max -2.449e-04 7.152e-04 -0.342 0.732087
timbre_2_min -2.127e-03 1.126e-03 -1.889 0.058843 .
timbre_2_max 6.586e-04 9.066e-04 0.726 0.467571
timbre_3_min 6.920e-04 5.985e-04 1.156 0.247583
timbre_3_max -2.967e-03 5.815e-04 -5.103 3.34e-07 ***
timbre_4_min 1.040e-02 1.985e-03 5.237 1.63e-07 ***
timbre_4_max 6.110e-03 1.550e-03 3.942 8.10e-05 ***
timbre_5_min -5.598e-03 1.277e-03 -4.385 1.16e-05 ***
timbre_5_max 7.736e-05 7.935e-04 0.097 0.922337
timbre_6_min -1.686e-02 2.264e-03 -7.445 9.66e-14 ***
timbre_6_max 3.668e-03 2.190e-03 1.675 0.093875 .
timbre_7_min -4.549e-03 1.781e-03 -2.554 0.010661 *
timbre_7_max -3.774e-03 1.832e-03 -2.060 0.039408 *
timbre_8_min 3.911e-03 2.851e-03 1.372 0.170123
timbre_8_max 4.011e-03 3.003e-03 1.336 0.181620
timbre_9_min 1.367e-03 2.998e-03 0.456 0.648356
timbre_9_max 1.603e-03 2.434e-03 0.659 0.510188
timbre_10_min 4.126e-03 1.839e-03 2.244 0.024852 *
timbre_10_max 5.825e-03 1.769e-03 3.292 0.000995 ***
timbre_11_min -2.625e-02 3.693e-03 -7.108 1.18e-12 ***
timbre_11_max 1.967e-02 3.385e-03 5.811 6.21e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6017.5 on 7200 degrees of freedom
Residual deviance: 4759.2 on 7167 degrees of freedom
AIC: 4827.2
Number of Fisher Scoring iterations: 6
AIC: 4827.2
Problem 2.3 - Creating Our Prediction Model
Let’s now think about the variables in our dataset related to the confidence of the time signature, key and tempo (timesignature_confidence, key_confidence, and tempo_confidence). Our model seems to indicate that these confidence variables are significant (rather than the variables timesignature, key and tempo themselves). What does the model suggest? #### The higher our confidence about time signature, key and tempo, the more likely the song is to be in the Top 10
Problem 2.4 - Creating Our Prediction Model
In general, if the confidence is low for the time signature, tempo, and key, then the song is more likely to be complex.
What does Model 1 suggest in terms of complexity? #### Mainstream listeners tend to prefer less complex songs
Problem 2.5 - Creating Our Prediction Model
Songs with heavier instrumentation tend to be louder (have higher values in the variable “loudness”) and more energetic (have higher values in the variable “energy”). By inspecting the coefficient of the variable “loudness”, what does Model 1 suggest? #### Mainstream listeners prefer songs with heavy instrumentation
By inspecting the coefficient of the variable “energy”, do we draw the same conclusions as above? #### No
Problem 3.1 - Beware of Multicollinearity Issues!
What is the correlation between the variables “loudness” and “energy” in the training set?
cor(SongsTrain$loudness, SongsTrain$energy)
[1] 0.7399067
Given that these two variables are highly correlated, Model 1 suffers from multicollinearity. To avoid this issue, we will omit one of these two variables and re-run the logistic regression.
In the rest of this problem, we’ll build two variations of our original model: Model 2, in which we keep “energy” and omit “loudness”, and Model 3, in which we keep “loudness” and omit “energy”.
Problem 3.2 - Beware of Multicollinearity Issues!
Create Model 2, which is Model 1 without the independent variable “loudness”.
SongsLog2 = glm(Top10 ~ . - loudness, data=SongsTrain, family=binomial)
We just subtracted the variable loudness. We couldn’t do this with the variables “songtitle” and “artistname”, because they are not numeric variables, and we might get different values in the test-set that the training set has never seen. But this approach (subtracting the variable from the model formula) will always work when you want to remove numeric variables.
Look at the summary of SongsLog2, and inspect the coefficient of the variable “energy”. What do you observe?
summary(SongsLog2)
Call:
glm(formula = Top10 ~ . - loudness, family = binomial, data = SongsTrain)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0983 -0.5607 -0.3602 -0.1902 3.3107
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.241e+00 7.465e-01 -3.002 0.002686 **
timesignature 1.625e-01 8.734e-02 1.860 0.062873 .
timesignature_confidence 6.885e-01 1.924e-01 3.578 0.000346 ***
tempo 5.521e-04 1.665e-03 0.332 0.740226
tempo_confidence 5.497e-01 1.407e-01 3.906 9.40e-05 ***
key 1.740e-02 1.026e-02 1.697 0.089740 .
key_confidence 2.954e-01 1.394e-01 2.118 0.034163 *
energy 1.813e-01 2.608e-01 0.695 0.486991
pitch -5.150e+01 6.857e+00 -7.511 5.87e-14 ***
timbre_0_min 2.479e-02 4.240e-03 5.847 5.01e-09 ***
timbre_0_max -1.007e-01 1.178e-02 -8.551 < 2e-16 ***
timbre_1_min 7.143e-03 7.710e-04 9.265 < 2e-16 ***
timbre_1_max -7.830e-04 7.064e-04 -1.108 0.267650
timbre_2_min -1.579e-03 1.109e-03 -1.424 0.154531
timbre_2_max 3.889e-04 8.964e-04 0.434 0.664427
timbre_3_min 6.500e-04 5.949e-04 1.093 0.274524
timbre_3_max -2.462e-03 5.674e-04 -4.339 1.43e-05 ***
timbre_4_min 9.115e-03 1.952e-03 4.670 3.02e-06 ***
timbre_4_max 6.306e-03 1.532e-03 4.115 3.87e-05 ***
timbre_5_min -5.641e-03 1.255e-03 -4.495 6.95e-06 ***
timbre_5_max 6.937e-04 7.807e-04 0.889 0.374256
timbre_6_min -1.612e-02 2.235e-03 -7.214 5.45e-13 ***
timbre_6_max 3.814e-03 2.157e-03 1.768 0.076982 .
timbre_7_min -5.102e-03 1.755e-03 -2.907 0.003644 **
timbre_7_max -3.158e-03 1.811e-03 -1.744 0.081090 .
timbre_8_min 4.488e-03 2.810e-03 1.597 0.110254
timbre_8_max 6.423e-03 2.950e-03 2.177 0.029497 *
timbre_9_min -4.282e-04 2.955e-03 -0.145 0.884792
timbre_9_max 3.525e-03 2.377e-03 1.483 0.138017
timbre_10_min 2.993e-03 1.804e-03 1.660 0.097004 .
timbre_10_max 7.367e-03 1.731e-03 4.255 2.09e-05 ***
timbre_11_min -2.837e-02 3.630e-03 -7.815 5.48e-15 ***
timbre_11_max 1.829e-02 3.341e-03 5.476 4.34e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6017.5 on 7200 degrees of freedom
Residual deviance: 4871.8 on 7168 degrees of freedom
AIC: 4937.8
Number of Fisher Scoring iterations: 6
Model 2 suggests that songs with high energy levels tend to be more popular. This contradicts our observation in Model 1.
Problem 3.3 - Beware of Multicollinearity Issues!
Now, create Model 3, which should be exactly like Model 1, but without the variable “energy”.
SongsLog3 = glm(Top10 ~ . - energy, data=SongsTrain, family=binomial)
Look at the summary of Model 3 and inspect the coefficient of the variable “loudness”. Remembering that higher loudness and energy both occur in songs with heavier instrumentation, do we make the same observation about the popularity of heavy instrumentation as we did with Model 2?
summary(SongsLog3)
Call:
glm(formula = Top10 ~ . - energy, family = binomial, data = SongsTrain)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9182 -0.5417 -0.3481 -0.1874 3.4171
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.196e+01 1.714e+00 6.977 3.01e-12 ***
timesignature 1.151e-01 8.726e-02 1.319 0.187183
timesignature_confidence 7.143e-01 1.946e-01 3.670 0.000242 ***
loudness 2.306e-01 2.528e-02 9.120 < 2e-16 ***
tempo -6.460e-04 1.665e-03 -0.388 0.698107
tempo_confidence 3.841e-01 1.398e-01 2.747 0.006019 **
key 1.649e-02 1.035e-02 1.593 0.111056
key_confidence 3.394e-01 1.409e-01 2.409 0.015984 *
pitch -5.328e+01 6.733e+00 -7.914 2.49e-15 ***
timbre_0_min 2.205e-02 4.239e-03 5.200 1.99e-07 ***
timbre_0_max -3.105e-01 2.537e-02 -12.240 < 2e-16 ***
timbre_1_min 5.416e-03 7.643e-04 7.086 1.38e-12 ***
timbre_1_max -5.115e-04 7.110e-04 -0.719 0.471928
timbre_2_min -2.254e-03 1.120e-03 -2.012 0.044190 *
timbre_2_max 4.119e-04 9.020e-04 0.457 0.647915
timbre_3_min 3.179e-04 5.869e-04 0.542 0.588083
timbre_3_max -2.964e-03 5.758e-04 -5.147 2.64e-07 ***
timbre_4_min 1.105e-02 1.978e-03 5.585 2.34e-08 ***
timbre_4_max 6.467e-03 1.541e-03 4.196 2.72e-05 ***
timbre_5_min -5.135e-03 1.269e-03 -4.046 5.21e-05 ***
timbre_5_max 2.979e-04 7.855e-04 0.379 0.704526
timbre_6_min -1.784e-02 2.246e-03 -7.945 1.94e-15 ***
timbre_6_max 3.447e-03 2.182e-03 1.580 0.114203
timbre_7_min -5.128e-03 1.768e-03 -2.900 0.003733 **
timbre_7_max -3.394e-03 1.820e-03 -1.865 0.062208 .
timbre_8_min 3.686e-03 2.833e-03 1.301 0.193229
timbre_8_max 4.658e-03 2.988e-03 1.559 0.119022
timbre_9_min -9.318e-05 2.957e-03 -0.032 0.974859
timbre_9_max 1.342e-03 2.424e-03 0.554 0.579900
timbre_10_min 4.050e-03 1.827e-03 2.217 0.026637 *
timbre_10_max 5.793e-03 1.759e-03 3.294 0.000988 ***
timbre_11_min -2.638e-02 3.683e-03 -7.162 7.96e-13 ***
timbre_11_max 1.984e-02 3.365e-03 5.896 3.74e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6017.5 on 7200 degrees of freedom
Residual deviance: 4782.7 on 7168 degrees of freedom
AIC: 4848.7
Number of Fisher Scoring iterations: 6
In the remainder of this problem, we’ll just use Model 3.
Problem 4.1 - Validating Our Model
Make predictions on the test-set using Model 3. What is the accuracy of Model 3 on the test-set, using a threshold of 0.45? (Compute the accuracy as a number between 0 and 1.)
predSongsTest = predict(SongsLog3, type="response", newdata = SongsTest)
table(SongsTest$Top10, predSongsTest > 0.45)
FALSE TRUE
0 309 5
1 40 19
(309 + 19) / nrow(SongsTest)
[1] 0.8793566
Problem 4.2 - Validating Our Model
Let’s check if there’s any incremental benefit in using Model 3 instead of a baseline model. Given the difficulty of guessing which song is going to be a hit, an easier model would be to pick the most frequent outcome (a song is not a Top 10 hit) for all songs.
What would the accuracy of the baseline model be on the test-set?
table(SongsTest$Top10)
0 1
314 59
314/(314 + 59)
[1] 0.8418231
Problem 4.3 - Validating Our Model
It seems that Model 3 gives us a small improvement over the baseline model. Still, does it create an edge? Let’s view the two models from an investment perspective. A production company is interested in investing in songs that are highly likely to make it to the Top 10. The company’s objective is to minimize its risk of financial losses attributed to investing in songs that end up unpopular.
A competitive edge can therefore be achieved if we can provide the production company a list of songs that are highly likely to end up in the Top 10. We note that the baseline model does not prove useful, as it simply does not label any song as a hit. Let us see what our model has to offer.
How many songs does Model 3 correctly predict as Top 10 hits in 2010 (remember that all songs in 2010 went into our test set), using a threshold of 0.45?
table(SongsTest$Top10, predSongsTest > 0.45)
FALSE TRUE
0 309 5
1 40 19
19
How many non-hit songs does Model 3 predict will be Top 10 hits (again, looking at the test set), using a threshold of 0.45? #### 5
Problem 4.4 - Validating Our Model
# what is the sensitivity of Model 3 on the test set, using a threshold of 0.45?
19 / (40 + 19)
[1] 0.3220339
# what is the specificity of Model 3 on the test set, using a threshold of 0.45?
309 / (309 + 5)
[1] 0.9840764
Conclusions
- Model 3 favors specificity over sensitivity.
- Model 3 provides conservative predictions, and predicts that a song will make it to the Top 10 very rarely. So while it detects less than half of the Top 10 songs, we can be very confident in the songs that it does predict to be Top 10 hits.