Automating Reviews In Medicine

Apr 8, 2019

The medical literature is enormous! Pubmed, a database of medical publications maintained by the U.S. National Library of Medicine, has indexed over 23 million medical publications. Further, the rate of medical publication has increased over time, and now there are nearly 1 million new publications in the field each year, or more than one per minute.

The large size and fast-changing nature of the medical literature has increased the need for reviews, which search databases like Pubmed for papers on a particular topic and then report results from the papers found. While such reviews are often performed manually, with multiple people reviewing each search result, this is tedious and time consuming. In this analysis, I’ll see how text analytics can be used to automate the process of information retrieval.

The dataset consists of the titles (variable title) and abstracts (variable abstract) of papers retrieved in a Pubmed search. Each search result is labeled with whether the paper is a clinical trial testing a drug therapy for cancer (variable trial). These labels were obtained by two people reviewing each search result and accessing the actual paper if necessary, as part of a literature review of clinical trials testing drug therapies for advanced and metastatic breast cancer.

Loading the packages

Problem 1.1 - Loading the Data

Load clinical_trial.csv into a dataframe called trials (remembering to add the argument stringsAsFactors=FALSE when working with text analytics, so that the text is read in properply), and investigate the dataframe.

trials <- read.csv("clinical_trial.csv", stringsAsFactors = FALSE)
str(trials)

'data.frame':   1860 obs. of  3 variables:
 $ title   : chr  "Treatment of Hodgkin's disease and other cancers with 1,3-bis(2-chloroethyl)-1-nitrosourea (BCNU; NSC-409962)." "Cell mediated immune status in malignancy--pretherapy and post-therapy assessment." "Neoadjuvant vinorelbine-capecitabine versus docetaxel-doxorubicin-cyclophosphamide in early nonresponsive breas"| __truncated__ "Randomized phase 3 trial of fluorouracil, epirubicin, and cyclophosphamide alone or followed by Paclitaxel for "| __truncated__ ...
 $ abstract: chr  "" "Twenty-eight cases of malignancies of different kinds were studied to assess T-cell activity and population bef"| __truncated__ "BACKGROUND: Among breast cancer patients, nonresponse to initial neoadjuvant chemotherapy is associated with un"| __truncated__ "BACKGROUND: Taxanes are among the most active drugs for the treatment of metastatic breast cancer, and, as a co"| __truncated__ ...
 $ trial   : int  1 0 1 1 1 0 1 0 0 0 ...

summary(trials)

    title             abstract             trial       
 Length:1860        Length:1860        Min.   :0.0000  
 Class :character   Class :character   1st Qu.:0.0000  
 Mode  :character   Mode  :character   Median :0.0000  
                                       Mean   :0.4392  
                                       3rd Qu.:1.0000  
                                       Max.   :1.0000

IMPORTANT NOTE: Should you get an error like “invalid multibyte string” when performing certain parts of this analysis, use the argument fileEncoding=“latin1” when reading in the file with read.csv. This should cause those errors to go away. We can use R’s string functions to learn more about the titles and abstracts of the located papers. The nchar() function counts the number of characters in a piece of text.

Using the nchar() function on the variables in the dataframe. How many characters are there in the longest abstract? (Longest here is defined as the abstract with the largest number of characters.)

max(nchar(trials$abstract))

[1] 3708

which.max(nchar(trials$abstract))

[1] 664

trials[664, ]

                                                                                                                                                title
664 Five versus more than five years of tamoxifen therapy for breast cancer patients with negative lymph nodes and estrogen receptor-positive tumors.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        abstract
664 BACKGROUND: In 1982, the National Surgical Adjuvant Breast and Bowel Project initiated a randomized, double-blinded, placebo-controlled trial (B-14) to determine the effectiveness of adjuvant tamoxifen therapy in patients with primary operable breast cancer who had estrogen receptor-positive tumors and no axillary lymph node involvement. The findings indicated that tamoxifen therapy provided substantial benefit to patients with early stage disease. However, questions arose about how long the observed benefit would persist, about the duration of therapy necessary to maintain maximum benefit, and about the nature and severity of adverse effects from prolonged treatment.PURPOSE: We evaluated the outcome of patients in the B-14 trial through 10 years of follow-up. In addition, the effects of 5 years versus more than 5 years of tamoxifen therapy were compared.METHODS: In the trial, patients were initially assigned to receive either tamoxifen at 20 mg/day (n = 1404) or placebo (n = 1414). Tamoxifen-treated patients who remained disease free after 5 years of therapy were then reassigned to receive either another 5 years of tamoxifen (n = 322) or 5 years of placebo (n = 321). After the study began, another group of patients who met the same protocol eligibility requirements as the randomly assigned patients were registered to receive tamoxifen (n = 1211). Registered patients who were disease free after 5 years of treatment were also randomly assigned to another 5 years of tamoxifen (n = 261) or to 5 years of placebo (n = 249). To compare 5 years with more than 5 years of tamoxifen therapy, data relating to all patients reassigned to an additional 5 years of the drug were combined. Patients who were not reassigned to either tamoxifen or placebo continued to be followed in the study. Survival, disease-free survival, and distant disease-free survival (relating to failure at distant sites) were estimated by use of the Kaplan-Meier method; differences between the treatment groups were assessed by use of the logrank test. The relative risks of failure (with 95% confidence intervals [CIs]) were determined by use of the Cox proportional hazards model. Reported P values are two-sided.RESULTS: Through 10 years of follow-up, a significant advantage in disease-free survival (69% versus 57%, P < .0001; relative risk = 0.66; 95% CI = 0.58-0.74), distant disease-free survival (76% versus 67%, P < .0001; relative risk = 0.70; 95% CI = 0.61-0.81), and survival (80% versus 76%, P = .02; relative risk = 0.84; 95% CI = 0.71-0.99) was found for patients in the group first assigned to receive tamoxifen. The survival benefit extended to those 49 years of age or younger and to those 50 years of age or older. Tamoxifen therapy was associated with a 37% reduction in the incidence of contralateral (opposite) breast cancer (P = .007). Through 4 years after the reassignment of tamoxifen-treated patients to either continued-therapy or placebo groups, advantages in disease-free survival (92% versus 86%, P = .003) and distant disease-free survival (96% versus 90%, P = .01) were found for those who discontinued tamoxifen treatment. Survival was 96% for those who discontinued tamoxifen compared with 94% for those who continued tamoxifen treatment (P = .08). A higher incidence of thromboembolic events was seen in tamoxifen-treated patients (through 5 years, 1.7% versus 0.4%). Except for endometrial cancer, the incidence of second cancers was not increased with tamoxifen therapy.CONCLUSIONS AND IMPLICATIONS: The benefit from 5 years of tamoxifen therapy persists through 10 years of follow-up. No additional advantage is obtained from continuing tamoxifen therapy for more than 5 years.
    trial
664     1

nchar(trials[664, ]$abstract)

[1] 3708

3708

Problem 1.2 - Loading the Data

How many search results provided no abstract? (HINT: A search result provided no abstract if the number of characters in the abstract field is zero.)

sum(nchar(trials$abstract) == 0)

[1] 112

Problem 1.3 - Loading the Data

Find the observation with the minimum number of characters in the title (the variable “title”) out of all of the observations in this dataset. What is the text of the title of this article?

Include capitalization and punctuation in our response, but don’t include the quotes.

min(nchar(trials$title))

[1] 28

which.min(nchar(trials$title))

[1] 1258

trials[1258,]

                            title
1258 A decade of letrozole: FACE.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              abstract
1258 Third-generation nonsteroidal aromatase inhibitors (AIs), letrozole and anastrozole, are superior to tamoxifen as initial therapy for early breast cancer but have not been directly compared in a head-to-head adjuvant trial. Cumulative evidence suggests that AIs are not equivalent in terms of potency of estrogen suppression and that there may be differences in clinical efficacy. Thus, with no data from head-to-head comparisons of the AIs as adjuvant therapy yet available, the question of whether there are efficacy differences between the AIs remains. To help answer this question, the Femara versus Anastrozole Clinical Evaluation (FACE) is a phase IIIb open-label, randomized, multicenter trial designed to test whether letrozole or anastrozole has superior efficacy as adjuvant treatment of postmenopausal women with hormone receptor (HR)- and lymph node-positive breast cancer. Eligible patients (target accrual, N=4,000) are randomized to receive either letrozole 2.5 mg or anastrozole 1 mg daily for up to 5 years. The primary objective is to compare disease-free survival at 5 years. Secondary end points include safety, overall survival, time to distant metastases, and time to contralateral breast cancer. The FACE trial will determine whether or not letrozole offers a greater clinical benefit to postmenopausal women with HR+ early breast cancer at increased risk of early recurrence compared with anastrozole.
     trial
1258     0

trials[1258,]$title

[1] "A decade of letrozole: FACE."

Problem 2.1 - Preparing the Corpus

Because we have both title and abstract information for trials, we need to build two corpera instead of one. Naming them corpusTitle and corpusAbstract.

The code performs the following tasks (you might need to load the “tm” package first if it isn’t already loaded). Making sure to perform them in this order.

# 1) Convert the title variable to corpusTitle and the abstract variable to corpusAbstract
corpusTitle <- Corpus(VectorSource(trials$title))
corpusTitle[[1]]$content

[1] "Treatment of Hodgkin's disease and other cancers with 1,3-bis(2-chloroethyl)-1-nitrosourea (BCNU; NSC-409962)."

corpusAbstract <- Corpus(VectorSource(trials$abstract))
corpusAbstract[[1]]$content

[1] ""

# 2) Convert corpusTitle and corpusAbstract to lowercase
corpusTitle = tm_map(corpusTitle, content_transformer(tolower))

Warning in tm_map.SimpleCorpus(corpusTitle, content_transformer(tolower)):
transformation drops documents

corpusAbstract = tm_map(corpusAbstract, content_transformer(tolower))

Warning in tm_map.SimpleCorpus(corpusAbstract,
content_transformer(tolower)): transformation drops documents

#corpusTitle = tm_map(corpusTitle, PlainTextDocument)
#corpusAbstract = tm_map(corpusAbstract, PlainTextDocument)

# 3) Remove the punctuation in corpusTitle and corpusAbstract
corpusTitle = tm_map(corpusTitle, removePunctuation)

Warning in tm_map.SimpleCorpus(corpusTitle, removePunctuation):
transformation drops documents

corpusTitle[[2]]$content

[1] "cell mediated immune status in malignancypretherapy and posttherapy assessment"

corpusAbstract = tm_map(corpusAbstract, removePunctuation)

Warning in tm_map.SimpleCorpus(corpusAbstract, removePunctuation):
transformation drops documents

# 4) Remove the English language stop words from corpusTitle and corpusAbstract
corpusTitle <- tm_map(corpusTitle, removeWords, stopwords("english"))

Warning in tm_map.SimpleCorpus(corpusTitle, removeWords,
stopwords("english")): transformation drops documents

corpusTitle[[2]]$content

[1] "cell mediated immune status  malignancypretherapy  posttherapy assessment"

corpusAbstract <- tm_map(corpusAbstract, removeWords, stopwords("english"))

Warning in tm_map.SimpleCorpus(corpusAbstract, removeWords,
stopwords("english")): transformation drops documents

corpusAbstract[[2]]$content

[1] "twentyeight cases  malignancies  different kinds  studied  assess tcell activity  population    institution  therapy fifteen cases  diagnosed  nonmetastasising squamous cell carcinoma  larynx pharynx laryngopharynx hypopharynx  tonsils seven cases  nonmetastasising infiltrating duct carcinoma  breast  6 cases  nonhodgkins lymphoma nhl   observed  3   15 cases 20  squamous cell carcinoma cases  mantoux test mt negative   tcell population  less  40 2   7 cases 286  infiltrating duct carcinoma  breast  mt negative   tcell population  less  40  3   6 cases 50  nhl  mt negative   tcell population  less  40  normal controls consisting  apparently normal healthy adults   tcell population    40    mt positive  patients  showed  negative skin test   tcell population less  40   subjected  assessment  tcell population  activity  appropriate therapy  clinical cure   disease   observed  2   3 cases 6666  squamous cell carcinomas 2   2 cases 100  adenocarcinomas  one   3 cases 3333  nhl showed positive conversion   tcell population    40"

# 5) Stem the words in corpusTitle and corpusAbstract (each stemming might take a few minutes)
corpusTitle = tm_map(corpusTitle, stemDocument)

Warning in tm_map.SimpleCorpus(corpusTitle, stemDocument): transformation
drops documents

corpusTitle[[2]]$content

[1] "cell mediat immun status malignancypretherapi posttherapi assess"

corpusAbstract = tm_map(corpusAbstract, stemDocument)

Warning in tm_map.SimpleCorpus(corpusAbstract, stemDocument):
transformation drops documents

corpusAbstract[[2]]$content

[1] "twentyeight case malign differ kind studi assess tcell activ popul institut therapi fifteen case diagnos nonmetastasis squamous cell carcinoma larynx pharynx laryngopharynx hypopharynx tonsil seven case nonmetastasis infiltr duct carcinoma breast 6 case nonhodgkin lymphoma nhl observ 3 15 case 20 squamous cell carcinoma case mantoux test mt negat tcell popul less 40 2 7 case 286 infiltr duct carcinoma breast mt negat tcell popul less 40 3 6 case 50 nhl mt negat tcell popul less 40 normal control consist appar normal healthi adult tcell popul 40 mt posit patient show negat skin test tcell popul less 40 subject assess tcell popul activ appropri therapi clinic cure diseas observ 2 3 case 6666 squamous cell carcinoma 2 2 case 100 adenocarcinoma one 3 case 3333 nhl show posit convers tcell popul 40"

# 6) Build a document term matrix called dtmTitle from corpusTitle and dtmAbstract from corpusAbstract
dtmTitle = DocumentTermMatrix(corpusTitle)
dtmTitle

<<DocumentTermMatrix (documents: 1860, terms: 2836)>>
Non-/sparse entries: 23416/5251544
Sparsity           : 100%
Maximal term length: 49
Weighting          : term frequency (tf)

dtmAbstract = DocumentTermMatrix(corpusAbstract)
dtmAbstract

<<DocumentTermMatrix (documents: 1860, terms: 12451)>>
Non-/sparse entries: 153290/23005570
Sparsity           : 99%
Maximal term length: 67
Weighting          : term frequency (tf)

# 7) Limit dtmTitle and dtmAbstract to terms with sparseness of at most 95% (aka terms that appear in at least 5% of documents)
dtmTitle <- removeSparseTerms(dtmTitle, 0.95)
dtmTitle

<<DocumentTermMatrix (documents: 1860, terms: 31)>>
Non-/sparse entries: 10683/46977
Sparsity           : 81%
Maximal term length: 15
Weighting          : term frequency (tf)

dtmAbstract <- removeSparseTerms(dtmAbstract, 0.95)
dtmAbstract

<<DocumentTermMatrix (documents: 1860, terms: 335)>>
Non-/sparse entries: 91969/531131
Sparsity           : 85%
Maximal term length: 15
Weighting          : term frequency (tf)

# 8) Convert dtmTitle and dtmAbstract to data frames (keep the names dtmTitle and dtmAbstract)
dtmTitle <- as.data.frame(as.matrix(dtmTitle))
dtmAbstract <- as.data.frame(as.matrix(dtmAbstract))

When removing stop words, use tm_map(corpusTitle, removeWords, sw) and tm_map(corpusAbstract, removeWords, sw) instead of tm_map(corpusTitle, removeWords, stopwords(“english”)) and tm_map(corpusAbstract, removeWords, stopwords(“english”)).

length(stopwords("english"))

[1] 174

How many terms remain in dtmTitle after removing sparse terms (aka how many columns does it have)?

str(dtmTitle)

'data.frame':   1860 obs. of  31 variables:
 $ cancer         : num  1 0 1 1 1 1 0 1 1 2 ...
 $ treatment      : num  1 0 0 0 1 0 0 0 0 1 ...
 $ breast         : num  0 0 1 1 1 1 0 1 1 1 ...
 $ earli          : num  0 0 1 1 0 0 0 1 0 0 ...
 $ iii            : num  0 0 1 0 0 0 0 0 0 1 ...
 $ phase          : num  0 0 1 1 0 0 0 0 0 1 ...
 $ random         : num  0 0 1 1 1 0 0 0 0 1 ...
 $ trial          : num  0 0 1 1 1 0 0 1 1 1 ...
 $ versus         : num  0 0 1 0 0 0 0 1 0 0 ...
 $ cyclophosphamid: num  0 0 0 1 0 0 0 0 0 0 ...
 $ chemotherapi   : num  0 0 0 0 1 1 0 0 0 0 ...
 $ combin         : num  0 0 0 0 1 0 1 0 0 0 ...
 $ effect         : num  0 0 0 0 1 0 0 1 0 1 ...
 $ metastat       : num  0 0 0 0 1 0 0 0 0 0 ...
 $ patient        : num  0 0 0 0 1 0 1 0 1 1 ...
 $ respons        : num  0 0 0 0 0 1 0 0 0 0 ...
 $ advanc         : num  0 0 0 0 0 0 1 0 0 0 ...
 $ postmenopaus   : num  0 0 0 0 0 0 0 1 1 0 ...
 $ randomis       : num  0 0 0 0 0 0 0 1 1 0 ...
 $ studi          : num  0 0 0 0 0 0 0 1 0 0 ...
 $ tamoxifen      : num  0 0 0 0 0 0 0 2 1 0 ...
 $ women          : num  0 0 0 0 0 0 0 1 0 0 ...
 $ adjuv          : num  0 0 0 0 0 0 0 0 1 0 ...
 $ group          : num  0 0 0 0 0 0 0 0 1 1 ...
 $ therapi        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ compar         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ doxorubicin    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ docetaxel      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ result         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ plus           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ clinic         : num  0 0 0 0 0 0 0 0 0 0 ...

31

How many terms remain in dtmAbstract?

str(dtmAbstract)

'data.frame':   1860 obs. of  335 variables:
 $ 100            : num  0 1 0 0 0 0 0 0 0 0 ...
 $ activ          : num  0 2 0 1 0 0 1 0 0 0 ...
 $ assess         : num  0 2 1 2 0 1 0 0 0 3 ...
 $ breast         : num  0 2 3 3 3 4 2 2 2 3 ...
 $ carcinoma      : num  0 5 0 0 0 0 0 0 0 2 ...
 $ case           : num  0 11 0 0 1 0 0 0 0 0 ...
 $ cell           : num  0 3 0 0 0 1 0 0 0 0 ...
 $ clinic         : num  0 1 0 1 0 0 0 0 0 0 ...
 $ consist        : num  0 1 0 0 0 0 0 0 0 0 ...
 $ control        : num  0 1 0 0 0 0 0 0 1 0 ...
 $ differ         : num  0 1 2 1 3 0 0 1 0 1 ...
 $ diseas         : num  0 1 0 1 3 0 0 1 0 0 ...
 $ less           : num  0 4 1 0 0 0 0 0 0 6 ...
 $ negat          : num  0 4 0 0 0 3 0 0 0 0 ...
 $ observ         : num  0 2 1 0 1 0 0 0 0 0 ...
 $ one            : num  0 1 0 0 0 0 0 0 0 0 ...
 $ patient        : num  0 1 9 5 5 6 8 3 2 5 ...
 $ popul          : num  0 8 0 0 0 0 0 0 0 0 ...
 $ posit          : num  0 2 0 1 0 5 0 0 0 0 ...
 $ seven          : num  0 1 0 0 0 0 0 0 0 0 ...
 $ show           : num  0 2 0 0 1 0 1 0 0 3 ...
 $ studi          : num  0 1 1 1 0 1 3 2 0 1 ...
 $ test           : num  0 2 1 1 0 0 0 0 0 0 ...
 $ therapi        : num  0 2 0 1 0 0 0 0 0 0 ...
 $ 500            : num  0 0 1 0 2 0 0 0 0 0 ...
 $ addit          : num  0 0 2 0 0 0 0 0 0 0 ...
 $ among          : num  0 0 2 4 0 0 0 0 1 0 ...
 $ arm            : num  0 0 7 4 2 0 0 1 0 1 ...
 $ assign         : num  0 0 2 1 0 0 0 1 1 1 ...
 $ associ         : num  0 0 1 3 0 2 0 1 2 0 ...
 $ background     : num  0 0 1 1 1 0 0 1 1 0 ...
 $ better         : num  0 0 1 0 1 1 0 0 0 0 ...
 $ cancer         : num  0 0 2 3 3 3 2 2 3 0 ...
 $ chang          : num  0 0 1 0 0 0 0 0 0 0 ...
 $ chemotherapi   : num  0 0 1 2 3 5 2 0 1 0 ...
 $ compar         : num  0 0 1 2 0 0 0 1 1 1 ...
 $ complet        : num  0 0 3 0 1 1 1 2 0 0 ...
 $ conclus        : num  0 0 1 0 1 0 0 0 0 0 ...
 $ confid         : num  0 0 1 1 0 0 0 0 0 0 ...
 $ continu        : num  0 0 2 0 1 0 0 2 0 0 ...
 $ cycl           : num  0 0 6 0 2 0 1 0 0 0 ...
 $ cyclophosphamid: num  0 0 1 1 1 1 3 0 0 0 ...
 $ day            : num  0 0 1 0 0 0 3 0 0 0 ...
 $ decreas        : num  0 0 1 0 0 0 0 0 1 0 ...
 $ defin          : num  0 0 2 0 0 0 0 0 0 0 ...
 $ demonstr       : num  0 0 1 0 0 0 0 0 0 0 ...
 $ docetaxel      : num  0 0 1 0 0 0 3 0 0 0 ...
 $ doxorubicin    : num  0 0 1 0 0 1 0 0 0 0 ...
 $ effect         : num  0 0 2 0 0 1 0 2 0 1 ...
 $ efficaci       : num  0 0 1 0 0 0 0 0 0 0 ...
 $ enrol          : num  0 0 1 0 0 0 1 0 0 1 ...
 $ four           : num  0 0 4 0 0 0 0 0 0 0 ...
 $ hematolog      : num  0 0 1 0 0 0 0 0 0 0 ...
 $ initi          : num  0 0 3 0 0 0 0 1 0 0 ...
 $ interv         : num  0 0 1 1 0 0 0 0 0 0 ...
 $ least          : num  0 0 2 0 0 0 0 0 0 0 ...
 $ lymph          : num  0 0 1 4 0 3 0 0 1 0 ...
 $ method         : num  0 0 1 0 1 0 0 1 1 0 ...
 $ mgm2           : num  0 0 5 0 4 0 9 0 0 0 ...
 $ neoadjuv       : num  0 0 2 0 0 0 0 0 0 0 ...
 $ node           : num  0 0 1 3 0 0 0 0 1 0 ...
 $ number         : num  0 0 1 1 0 2 1 0 0 0 ...
 $ outcom         : num  0 0 2 0 0 0 0 0 0 2 ...
 $ patholog       : num  0 0 3 0 0 1 0 0 0 0 ...
 $ per            : num  0 0 1 0 0 0 0 0 0 0 ...
 $ previous       : num  0 0 1 0 1 0 2 0 0 0 ...
 $ random         : num  0 0 2 1 1 0 0 1 1 2 ...
 $ rate           : num  0 0 1 1 2 0 1 0 1 0 ...
 $ receiv         : num  0 0 2 0 1 1 3 0 0 2 ...
 $ reduct         : num  0 0 1 2 0 0 0 1 0 0 ...
 $ regimen        : num  0 0 2 0 0 0 1 0 0 0 ...
 $ respond        : num  0 0 2 0 0 0 0 0 0 0 ...
 $ respons        : num  0 0 7 0 4 2 2 0 0 0 ...
 $ result         : num  0 0 1 0 1 0 1 0 0 0 ...
 $ similar        : num  0 0 2 0 0 0 0 0 1 0 ...
 $ size           : num  0 0 1 3 0 1 0 0 0 0 ...
 $ statist        : num  0 0 1 3 0 0 0 0 0 0 ...
 $ surgeri        : num  0 0 1 1 0 0 0 0 0 0 ...
 $ toler          : num  0 0 1 0 0 0 1 0 0 0 ...
 $ toxic          : num  0 0 2 0 1 0 1 0 0 5 ...
 $ treatment      : num  0 0 3 6 14 0 0 4 1 1 ...
 $ tumor          : num  0 0 2 4 0 0 2 0 0 0 ...
 $ two            : num  0 0 3 0 1 0 0 1 0 0 ...
 $ 001            : num  0 0 0 1 0 0 0 1 0 0 ...
 $ adjuv          : num  0 0 0 2 0 1 0 2 4 0 ...
 $ age            : num  0 0 0 1 0 0 0 0 0 0 ...
 $ also           : num  0 0 0 1 0 0 0 0 2 0 ...
 $ analysi        : num  0 0 0 2 0 0 0 1 0 0 ...
 $ analyz         : num  0 0 0 1 0 0 0 0 0 0 ...
 $ axillari       : num  0 0 0 1 0 0 0 0 1 0 ...
 $ death          : num  0 0 0 2 0 0 0 0 1 0 ...
 $ dfs            : num  0 0 0 3 0 0 0 0 0 0 ...
 $ diseasefre     : num  0 0 0 1 0 2 0 0 3 0 ...
 $ drug           : num  0 0 0 1 0 0 0 0 0 0 ...
 $ elig           : num  0 0 0 1 0 0 0 0 0 0 ...
 $ endpoint       : num  0 0 0 2 0 0 0 0 1 3 ...
 $ epirubicin     : num  0 0 0 1 1 0 0 0 0 0 ...
 $ estim          : num  0 0 0 1 1 0 0 0 0 0 ...
 $ fluorouracil   : num  0 0 0 1 1 0 0 0 0 0 ...
  [list output truncated]

335

Problem 2.2 - Preparing the Corpus

What is the most likely reason why dtmAbstract has so many more terms than dtmTitle? #### Abstracts tend to have many more words than titles

Problem 2.3 - Preparing the Corpus

What is the most frequent word stem across all the abstracts? Hint: you can use colSums() to compute the frequency of a word across all the abstracts.

?colSums
colSums(dtmAbstract)

            100           activ          assess          breast 
            225             509             668            3859 
      carcinoma            case            cell          clinic 
            251             233             359             944 
        consist         control          differ          diseas 
            200             621            1176             950 
           less           negat          observ             one 
            351             258             700             570 
        patient           popul           posit           seven 
           8381             162             511             108 
           show           studi            test         therapi 
            516            1965             282            1564 
            500           addit           among             arm 
            169             420             365            1038 
         assign          associ      background          better 
            435             604             397             186 
         cancer           chang    chemotherapi          compar 
           3726             431            2344            1359 
        complet         conclus          confid         continu 
            628             842             241             281 
           cycl cyclophosphamid             day         decreas 
            962             632            1245             350 
          defin        demonstr       docetaxel     doxorubicin 
            123             251             514             486 
         effect        efficaci           enrol            four 
           1340             591             221             369 
      hematolog           initi          interv           least 
            117             275             349             177 
          lymph          method            mgm2        neoadjuv 
            249             892            1093             293 
           node          number          outcom        patholog 
            477             296             335             254 
            per        previous          random            rate 
            218             355            1520            1253 
         receiv          reduct         regimen         respond 
           1908             301             807             200 
        respons          result         similar            size 
           2051            1485             438             177 
        statist         surgeri           toler           toxic 
            384             407             373            1065 
      treatment           tumor             two             001 
           2894            1122             889             162 
          adjuv             age            also         analysi 
           1162             429             364             587 
         analyz        axillari           death             dfs 
            124             292             215             310 
     diseasefre            drug            elig        endpoint 
            364             332             196             213 
     epirubicin           estim    fluorouracil          follow 
            339             139             215             675 
          found          hazard            her2          hormon 
            238             301             314             428 
         includ          involv          marker        metastat 
            529             180             189             755 
          model       multivari       nodeposit            oper 
            180             154             199             193 
         overal      paclitaxel         predict         primari 
            962             397             369             718 
       prognost         proport           ratio        receptor 
            242             125             344             573 
          reduc          relaps         respect            risk 
            400             254             758             635 
          sampl       secondari        signific          status 
            172             158            2043             538 
         surviv            type            valu            week 
           1927             126             256            1074 
          women            year           agent         benefit 
           1484            1335             240             551 
         combin          detect        determin           durat 
            926             148             352             344 
         either           evalu           everi            evid 
            532             926             487             150 
       firstlin           howev            life             may 
            182             339             178             413 
         measur          object         partial         perform 
            411             400             295             342 
           plus         potenti        progress         prolong 
            622             156             622             125 
        qualiti           score           singl           stabl 
            189             254             149             154 
       subgroup            term            time           total 
            192             153             881             397 
            use           vomit         whether            0001 
           1053             174             235             249 
  5fluorouracil             aim          correl         express 
            208             185             203             356 
         factor        followup           grade           group 
            552             494             580            2668 
           high        independ            larg           level 
            378             149             108             743 
         longer             low          median           month 
            193             196            1180            1575 
    premenopaus     progesteron        randomis          remain 
            303             114             264             158 
          trend          tumour          wherea          achiev 
            115             320             173             245 
       administ          advanc           daili            dose 
            322             556             412            1123 
          indic           infus        intraven     neutropenia 
            269             237             192             234 
          phase           prior         support           treat 
            481             305             183             893 
          trial        aromatas            caus             due 
           1417             171             115             154 
          earli             end        estrogen         increas 
            325             221             421             729 
      inhibitor           lower        particip           point 
            182             236             144             202 
   postmenopaus   receptorposit       tamoxifen          versus 
            590             152            1632             570 
         within            alon             can         distant 
            172             472             191             149 
           find           first          higher             iii 
            177             421             415             266 
         improv           limit      mastectomi        metastas 
            562             127             165             352 
          occur          postop    radiotherapi          recurr 
            312             177             244             465 
           site           stage          system          advers 
            183             286             193             256 
        baselin          common     doubleblind           event 
            340             191             149             409 
        greater            mean         placebo          purpos 
            279             310             475             434 
         obtain            oral         present           relat 
            147             422             218             351 
        suggest            bone       administr           cours 
            274             514             218             283 
          given          safeti         schedul            seen 
            374             265             215             199 
       standard         without        although             new 
            305             306             191             171 
         accord    anthracyclin            base           eight 
            182             207             124             124 
       histolog        investig           local             set 
            127             295             300             191 
         analys           sever           three          import 
            177             288             564             138 
          shown             005           avail            data 
            117             124             108             405 
          incid          period          profil          report 
            300             170             158             357 
       endocrin           hundr       multicent          design 
            266             195             126             219 
         growth           human        pretreat            well 
            208             144             175             328 
         examin             six          appear        identifi 
            190             261             164             148 
         nausea          provid         regress            five 
            239             155             120             173 
        conduct        prospect         develop        sequenti 
            177             239             259             168 
          serum      comparison        frequent             cmf 
            315             116             153             586 
         consid            rang          select         possibl 
            131             248             124             130 
         inform           major            need        function 
            124             122             128             188 
         failur         confirm          requir       experienc 
            262             178             168             167 
   characterist       methotrex  progressionfre         general 
            119             265             158             126 
        prevent           andor             mbc            main 
            143             128             276             131 
           side        superior           start           tissu 
            168             161             131             197 
         second          regard           enter 
            138             105             117

max(colSums(dtmAbstract))

[1] 8381

which.max(colSums(dtmAbstract))

patient 
     17

dtmAbstract$patient

   [1]  0  1  9  5  5  6  8  3  2  5  2  4  2  1  1  3  0  8  6  0  3  7  7
  [24]  4  2  5  2  5  0  0  3  2  3  5 12  3  5  4  7  2  0  1  2  3  6  5
  [47]  1  4  7  6  2  5  6  5  5  6  9  5  1  5 10  6  3  3  6  1  4  4  0
  [70]  6  9  5  9  1 11  5  0  3  5  6  3  8  8  2  6  9  3  7  4  1  6 13
  [93]  3 12  0  5  5  3  0  1  8 16  5 13  0  7  2  7  0  7  3  2  5  1  6
 [116] 13  4  5  4  4  4  6  5  6  5  3  1  3  3  0  5  1  3  9  8  2  3  7
 [139]  5  3  4  4 11  2  9  1  1  6  8  1  2  1  1  7  0  2  4  6 16  0  3
 [162]  0  3  8  3  5  2 11  5  0  0  2  8  8  1  4  7 11  6  5  6  1  7  2
 [185] 14 13  4 23  7  3  6  3  3  2  3  4 11  2  4 12  3  6  4  3  6  9  4
 [208]  8  2  4  2  6  1  7  6  5  5 14  5 13  0 11  1  7  6  3  5  4  3  7
 [231]  0  0  7  2  4  5  7  0  4  4 16  0  8  3  0  3  7  6  6  2  2  1  5
 [254]  4  7  8  1  1  5  4  2  3 16  2  1  5  7  3  4  0  6  3  2  1  5  6
 [277]  6  2  9  6  2  6  5  7  9 10  9  5  2  4  5  0  0  3  3  6  4  6  3
 [300]  9  6  3  3  8  5  0  7  9  4  2  7  3  5  7  4  0  2  6  8  1  4  5
 [323]  2  1  2  2  4  6  5 13  2  0 10  4  4  4  9  7  2  5  3  4  4  4  4
 [346]  3  9  2  5  0  9  0  4  5  2  7  5  1  4  5  5  5  4  2  0  2  9  9
 [369]  1  0  6  3  7  8  0  2  6  5  2  1  2  1  8  3  6  6  7 17  3  7  6
 [392]  3  8  9  1  8  5 10  0  7  4  6  8  3  8  0  9  8  3  1  0  0  7  4
 [415]  8  5  4  4  5  5  4  4  0  3  3  5  1  9  3  2  3  3  8  4  9  7  0
 [438]  5  1  0 10  7  6  5  7  4  5  2 13  1  6  7  3  4  5  7  0  5  4  4
 [461]  2  6  4  3  2  2  5  5  5  0  1  0  8  2  0  2  0  2  8 11  0  2  3
 [484]  5  7  6  4  3  6  4  7 10 12  7 11  7  2  8  6  2  4  4  5  5  3  8
 [507]  3  0  0  5  4  5  1  4  0  0  6  3  3  3  6  2  3  2  3  4 11  3  5
 [530]  7  2  6  6  2  4  6  1 13  5  3  7  4  3  3  4  3  8  3  5  1  5  3
 [553]  0 11  5  7  2  1  0  6  4  5  5  4  3  8  4  4  1  4  2  7  9 14 10
 [576]  0  6  2  1  8  0  2  5  2  2  1  5  0 10  6  5  0  1  1  4  4  4  0
 [599]  0  4  8  3  3  8  7  5  7  1  2  3  3  9  5 11  8  8  6  4  3  2  0
 [622]  2  6  9  1  0 12  3  2 12  9  6  3  6  0  4  5  4  1 13  3  0  2  5
 [645]  3  1  8  5  9  5  5  9  5  5  7  6  5  6  6  3  0  6  7 13  7  9  5
 [668]  6  7  5  2  6  1  3  0  3  4  3  3  0 10  0  7  0  6  3  9  4  7 12
 [691]  0  0  7  9  0  7  6  6  3  2  5  2  0  5  5  7  4  5  5  3  6  2  7
 [714]  1  2  1  6  4 11  9  2  0  8 10  2  7  2  6  1  8  0  0  4  3  9  5
 [737]  2  3  1  7  4  0  0  0  6  3 13  3  5  5  0  4  2  3  6  4  5  0  1
 [760]  4  9 12  4  0 10  5  9  4  3 10  6  2  3  5  3  9  3  6  1  5  7  0
 [783]  4  5  3  4  1  0  1  3 10 17  0  7  5  2  0  9  0  1  5  8  0  6  9
 [806]  2  0  9 10  0  3  3  4  6  5  5  1  1 12 10  4  0  5  4  6  3  7  2
 [829]  1  0  0  4  0 10  5  8  9  8  1  3  2  6  1 12  2  0  5  2  9  3  4
 [852]  7  5  3 10  6  6  9  0  4  7  4  1  6  0  6  5  8  3  3  0  4  9  8
 [875]  2  8  3 11  7  1  5  4  7  6  8  3  1  5  6  6  0  5 20  5  4 13  2
 [898]  1  0  2  1  3  4 10  4  5  5  1  9  0  1  0  0 11  3 12  4  6  7  4
 [921]  0  3  3  5  6 11  0  1  4 10  2  5  8  1  0  7  6  2  5  4  0  2  1
 [944]  6  5 10  3  0  0  3 12  3  4 10  0  5  3  3  9  0  1  3 17  1  8 14
 [967]  6  3  3  2  8  0  6  8  1  0 11 10 10 18  0  7  9  2  5  1  7  5  5
 [990]  5  6  4  3  7  8  2 10  5  0 10  5  5  0  5  3  1  4  4  3  2  6  1
[1013]  2  6  3  6 14  2  0  6  5  3  4  9  2  5  8  4  5  3  4  7  5  6  2
[1036]  3  4  7  8 11  8  6  9  4  7  5 14  7  2 12  0  5  3  6  1  4  0  1
[1059]  6  5  4  3  0  2  2  1  6  5  2  1  1  9  6 11  4  1  5  5  0  5 10
[1082]  0  5  2  2  4  5  7  1  4  5  2 10  0  0  5  0  2  3  3  6  2  8  3
[1105]  3  3  1  5  6  0  0  2  4  5  2  8 10  7  9  5 10  4  7 10  6 10  6
[1128]  3  0  0  7  6  8  4  4  4  0  8  7  7  1  6  6 10  8  4  4  5 11  7
[1151]  5  7  4  6  4  4  7  3 12 12  0  3  0  4  2  7  4  4  6  0  6  6  1
[1174]  9  5  6 11  6 15  3  2  3  6  4  5  4  1  3  0  3  0  2  6  0  1  7
[1197]  2  8  0  0  0  4 11  9  2 10  3  6  5  2  1 11  0  6  1  5  4  3  2
[1220]  2  6  7  1 11  6 10  5  4  7 10  4  0  4  4  0  2  8  7  5  4  6  4
[1243]  0  1  7  6  5  0 12  7  7  7  5  7  0 11  5  1  8 11  3  1  8  4  3
[1266]  3  7  2  8  2  4  0  7  5  0  1  0  9  7  1  7  4  7  8  3  1  6  3
[1289]  2  4  3  2  3  9  6  2  3  6  5 11  5  7  8  0  6  4  5  4  1  1  3
[1312]  5  2  0  6  3  3  0  6  6  0  7 11  0  3  3  7  5  2  0  3  5  2  0
[1335]  4  4  3  6  4  3  4  5  0  7  1  6  6  6  5  5 13 15  2  7 12  3  2
[1358]  6  5  5  5  9  7  0  5  7  4  8  2  0  2  2  7  8  2  4  0  7  9  4
[1381]  4  6  2  8  4  0  1 10  4  4  5  3  4  6  4  8  2  6 12  5  5  5  4
[1404]  5 11  4  4  7  5  6  7  2  3  6  3  3  2  5  2  4  1  2  2  7  7  4
[1427] 13  4 12  4  7  7  4  2  0  7  6  7  6  7  4  5  5  8  5  2  5  1  3
[1450]  0 10  2  0  5  3  3 11  1  0  3  1  5  0  2  0  3  2  0  6  0  3  2
[1473]  4 11  0  0  1  4  1  3  2  1  5  0  2  3  2  3 10  3  4  2  0  0  9
[1496]  4  8  4  0  2  5  5  0  0  2  0  8  3  0  9  4  3  2  2  4  1  0  6
[1519]  2  2  2  0  3  0  0  6  0  0  4  2  4 12  2  8  6  2  1  0  6  3  4
[1542]  5  0  6  1  1  2  5  1  3  4 11  9  0  6  2  7  5  3  7  5  0  2  7
[1565] 10  2  3  3  5  3  4  2  3  4  3  1  6  2  1  3  1  3  2  3  5  4  3
[1588]  0  1  0  5  1  3  9  1  6  4  1  1  7  2  0  1  7  3  0  5  2  5  2
[1611]  2  2  0  6  0  5  1  4  0  0  1  7  1  4  6  2  0  1  8  2  3  5  0
[1634]  7  7  7  4  7  7  3  5  6  1  6  4 16  5  8  7  3  1  7  3 10  4  7
[1657] 10  4  9  1 10  4  6  9  0  4  5  2  9  5  1  8  4 12  8  6  3  6 10
[1680]  2  3  1  1 13  4  2  5  7 10  7 10  5  4  3  8  4  3  6  5  2  9  6
[1703]  7  4  4  7  3  6  9  5  3  4  4  6  8  4  6  1  8  7 10  6  7  7  1
[1726]  9  6 15 10 10  9  8  6  5  8  5  6  7  4  8  9  8  8  9 10  6  5  6
[1749]  6  9 11  4  4 11  4  5  6  7  4  6  3  7  4  7  9  8  4  5 10  6  2
[1772]  0  8  5  5  6  3  7  4  0  2  4  1  0  3  6 10  3 14  3 10  0  5  0
[1795]  8  0  5  0  9  0  3  0  0  8  3  5  5  2  0  0  4 16  2  7  3  0  0
[1818]  0  7  6  0  2  5  5  0  4  5 10  9  3  5  9  4  1  4  4  2  0  3  2
[1841]  4  4  5  0  9  8  4  0  0  0  8  2  4  4  0  0  0  0  0  0

Problem 3.1 - Building a model

We want to combine dtmTitle and dtmAbstract into a single dataframe to make predictions. However, some of the variables in these dataframes have the same names. To fix this issue, run the following code:

colnames(dtmTitle) <- paste0("T", colnames(dtmTitle))
str(dtmTitle)

'data.frame':   1860 obs. of  31 variables:
 $ Tcancer         : num  1 0 1 1 1 1 0 1 1 2 ...
 $ Ttreatment      : num  1 0 0 0 1 0 0 0 0 1 ...
 $ Tbreast         : num  0 0 1 1 1 1 0 1 1 1 ...
 $ Tearli          : num  0 0 1 1 0 0 0 1 0 0 ...
 $ Tiii            : num  0 0 1 0 0 0 0 0 0 1 ...
 $ Tphase          : num  0 0 1 1 0 0 0 0 0 1 ...
 $ Trandom         : num  0 0 1 1 1 0 0 0 0 1 ...
 $ Ttrial          : num  0 0 1 1 1 0 0 1 1 1 ...
 $ Tversus         : num  0 0 1 0 0 0 0 1 0 0 ...
 $ Tcyclophosphamid: num  0 0 0 1 0 0 0 0 0 0 ...
 $ Tchemotherapi   : num  0 0 0 0 1 1 0 0 0 0 ...
 $ Tcombin         : num  0 0 0 0 1 0 1 0 0 0 ...
 $ Teffect         : num  0 0 0 0 1 0 0 1 0 1 ...
 $ Tmetastat       : num  0 0 0 0 1 0 0 0 0 0 ...
 $ Tpatient        : num  0 0 0 0 1 0 1 0 1 1 ...
 $ Trespons        : num  0 0 0 0 0 1 0 0 0 0 ...
 $ Tadvanc         : num  0 0 0 0 0 0 1 0 0 0 ...
 $ Tpostmenopaus   : num  0 0 0 0 0 0 0 1 1 0 ...
 $ Trandomis       : num  0 0 0 0 0 0 0 1 1 0 ...
 $ Tstudi          : num  0 0 0 0 0 0 0 1 0 0 ...
 $ Ttamoxifen      : num  0 0 0 0 0 0 0 2 1 0 ...
 $ Twomen          : num  0 0 0 0 0 0 0 1 0 0 ...
 $ Tadjuv          : num  0 0 0 0 0 0 0 0 1 0 ...
 $ Tgroup          : num  0 0 0 0 0 0 0 0 1 1 ...
 $ Ttherapi        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Tcompar         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Tdoxorubicin    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Tdocetaxel      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Tresult         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Tplus           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Tclinic         : num  0 0 0 0 0 0 0 0 0 0 ...

colnames(dtmAbstract) <- paste0("A", colnames(dtmAbstract))
str(dtmAbstract)

'data.frame':   1860 obs. of  335 variables:
 $ A100            : num  0 1 0 0 0 0 0 0 0 0 ...
 $ Aactiv          : num  0 2 0 1 0 0 1 0 0 0 ...
 $ Aassess         : num  0 2 1 2 0 1 0 0 0 3 ...
 $ Abreast         : num  0 2 3 3 3 4 2 2 2 3 ...
 $ Acarcinoma      : num  0 5 0 0 0 0 0 0 0 2 ...
 $ Acase           : num  0 11 0 0 1 0 0 0 0 0 ...
 $ Acell           : num  0 3 0 0 0 1 0 0 0 0 ...
 $ Aclinic         : num  0 1 0 1 0 0 0 0 0 0 ...
 $ Aconsist        : num  0 1 0 0 0 0 0 0 0 0 ...
 $ Acontrol        : num  0 1 0 0 0 0 0 0 1 0 ...
 $ Adiffer         : num  0 1 2 1 3 0 0 1 0 1 ...
 $ Adiseas         : num  0 1 0 1 3 0 0 1 0 0 ...
 $ Aless           : num  0 4 1 0 0 0 0 0 0 6 ...
 $ Anegat          : num  0 4 0 0 0 3 0 0 0 0 ...
 $ Aobserv         : num  0 2 1 0 1 0 0 0 0 0 ...
 $ Aone            : num  0 1 0 0 0 0 0 0 0 0 ...
 $ Apatient        : num  0 1 9 5 5 6 8 3 2 5 ...
 $ Apopul          : num  0 8 0 0 0 0 0 0 0 0 ...
 $ Aposit          : num  0 2 0 1 0 5 0 0 0 0 ...
 $ Aseven          : num  0 1 0 0 0 0 0 0 0 0 ...
 $ Ashow           : num  0 2 0 0 1 0 1 0 0 3 ...
 $ Astudi          : num  0 1 1 1 0 1 3 2 0 1 ...
 $ Atest           : num  0 2 1 1 0 0 0 0 0 0 ...
 $ Atherapi        : num  0 2 0 1 0 0 0 0 0 0 ...
 $ A500            : num  0 0 1 0 2 0 0 0 0 0 ...
 $ Aaddit          : num  0 0 2 0 0 0 0 0 0 0 ...
 $ Aamong          : num  0 0 2 4 0 0 0 0 1 0 ...
 $ Aarm            : num  0 0 7 4 2 0 0 1 0 1 ...
 $ Aassign         : num  0 0 2 1 0 0 0 1 1 1 ...
 $ Aassoci         : num  0 0 1 3 0 2 0 1 2 0 ...
 $ Abackground     : num  0 0 1 1 1 0 0 1 1 0 ...
 $ Abetter         : num  0 0 1 0 1 1 0 0 0 0 ...
 $ Acancer         : num  0 0 2 3 3 3 2 2 3 0 ...
 $ Achang          : num  0 0 1 0 0 0 0 0 0 0 ...
 $ Achemotherapi   : num  0 0 1 2 3 5 2 0 1 0 ...
 $ Acompar         : num  0 0 1 2 0 0 0 1 1 1 ...
 $ Acomplet        : num  0 0 3 0 1 1 1 2 0 0 ...
 $ Aconclus        : num  0 0 1 0 1 0 0 0 0 0 ...
 $ Aconfid         : num  0 0 1 1 0 0 0 0 0 0 ...
 $ Acontinu        : num  0 0 2 0 1 0 0 2 0 0 ...
 $ Acycl           : num  0 0 6 0 2 0 1 0 0 0 ...
 $ Acyclophosphamid: num  0 0 1 1 1 1 3 0 0 0 ...
 $ Aday            : num  0 0 1 0 0 0 3 0 0 0 ...
 $ Adecreas        : num  0 0 1 0 0 0 0 0 1 0 ...
 $ Adefin          : num  0 0 2 0 0 0 0 0 0 0 ...
 $ Ademonstr       : num  0 0 1 0 0 0 0 0 0 0 ...
 $ Adocetaxel      : num  0 0 1 0 0 0 3 0 0 0 ...
 $ Adoxorubicin    : num  0 0 1 0 0 1 0 0 0 0 ...
 $ Aeffect         : num  0 0 2 0 0 1 0 2 0 1 ...
 $ Aefficaci       : num  0 0 1 0 0 0 0 0 0 0 ...
 $ Aenrol          : num  0 0 1 0 0 0 1 0 0 1 ...
 $ Afour           : num  0 0 4 0 0 0 0 0 0 0 ...
 $ Ahematolog      : num  0 0 1 0 0 0 0 0 0 0 ...
 $ Ainiti          : num  0 0 3 0 0 0 0 1 0 0 ...
 $ Ainterv         : num  0 0 1 1 0 0 0 0 0 0 ...
 $ Aleast          : num  0 0 2 0 0 0 0 0 0 0 ...
 $ Alymph          : num  0 0 1 4 0 3 0 0 1 0 ...
 $ Amethod         : num  0 0 1 0 1 0 0 1 1 0 ...
 $ Amgm2           : num  0 0 5 0 4 0 9 0 0 0 ...
 $ Aneoadjuv       : num  0 0 2 0 0 0 0 0 0 0 ...
 $ Anode           : num  0 0 1 3 0 0 0 0 1 0 ...
 $ Anumber         : num  0 0 1 1 0 2 1 0 0 0 ...
 $ Aoutcom         : num  0 0 2 0 0 0 0 0 0 2 ...
 $ Apatholog       : num  0 0 3 0 0 1 0 0 0 0 ...
 $ Aper            : num  0 0 1 0 0 0 0 0 0 0 ...
 $ Aprevious       : num  0 0 1 0 1 0 2 0 0 0 ...
 $ Arandom         : num  0 0 2 1 1 0 0 1 1 2 ...
 $ Arate           : num  0 0 1 1 2 0 1 0 1 0 ...
 $ Areceiv         : num  0 0 2 0 1 1 3 0 0 2 ...
 $ Areduct         : num  0 0 1 2 0 0 0 1 0 0 ...
 $ Aregimen        : num  0 0 2 0 0 0 1 0 0 0 ...
 $ Arespond        : num  0 0 2 0 0 0 0 0 0 0 ...
 $ Arespons        : num  0 0 7 0 4 2 2 0 0 0 ...
 $ Aresult         : num  0 0 1 0 1 0 1 0 0 0 ...
 $ Asimilar        : num  0 0 2 0 0 0 0 0 1 0 ...
 $ Asize           : num  0 0 1 3 0 1 0 0 0 0 ...
 $ Astatist        : num  0 0 1 3 0 0 0 0 0 0 ...
 $ Asurgeri        : num  0 0 1 1 0 0 0 0 0 0 ...
 $ Atoler          : num  0 0 1 0 0 0 1 0 0 0 ...
 $ Atoxic          : num  0 0 2 0 1 0 1 0 0 5 ...
 $ Atreatment      : num  0 0 3 6 14 0 0 4 1 1 ...
 $ Atumor          : num  0 0 2 4 0 0 2 0 0 0 ...
 $ Atwo            : num  0 0 3 0 1 0 0 1 0 0 ...
 $ A001            : num  0 0 0 1 0 0 0 1 0 0 ...
 $ Aadjuv          : num  0 0 0 2 0 1 0 2 4 0 ...
 $ Aage            : num  0 0 0 1 0 0 0 0 0 0 ...
 $ Aalso           : num  0 0 0 1 0 0 0 0 2 0 ...
 $ Aanalysi        : num  0 0 0 2 0 0 0 1 0 0 ...
 $ Aanalyz         : num  0 0 0 1 0 0 0 0 0 0 ...
 $ Aaxillari       : num  0 0 0 1 0 0 0 0 1 0 ...
 $ Adeath          : num  0 0 0 2 0 0 0 0 1 0 ...
 $ Adfs            : num  0 0 0 3 0 0 0 0 0 0 ...
 $ Adiseasefre     : num  0 0 0 1 0 2 0 0 3 0 ...
 $ Adrug           : num  0 0 0 1 0 0 0 0 0 0 ...
 $ Aelig           : num  0 0 0 1 0 0 0 0 0 0 ...
 $ Aendpoint       : num  0 0 0 2 0 0 0 0 1 3 ...
 $ Aepirubicin     : num  0 0 0 1 1 0 0 0 0 0 ...
 $ Aestim          : num  0 0 0 1 1 0 0 0 0 0 ...
 $ Afluorouracil   : num  0 0 0 1 1 0 0 0 0 0 ...
  [list output truncated]

What was the effect of these functions? #### Adding the letter T in front of all the title variable names and adding the letter A in front of all the abstract variable names.

Problem 3.2 - Building a Model

Using cbind(), combine dtmTitle and dtmAbstract into a single dataframe called dtm:

dtm <- cbind(dtmTitle, dtmAbstract)
str(dtm)

'data.frame':   1860 obs. of  366 variables:
 $ Tcancer         : num  1 0 1 1 1 1 0 1 1 2 ...
 $ Ttreatment      : num  1 0 0 0 1 0 0 0 0 1 ...
 $ Tbreast         : num  0 0 1 1 1 1 0 1 1 1 ...
 $ Tearli          : num  0 0 1 1 0 0 0 1 0 0 ...
 $ Tiii            : num  0 0 1 0 0 0 0 0 0 1 ...
 $ Tphase          : num  0 0 1 1 0 0 0 0 0 1 ...
 $ Trandom         : num  0 0 1 1 1 0 0 0 0 1 ...
 $ Ttrial          : num  0 0 1 1 1 0 0 1 1 1 ...
 $ Tversus         : num  0 0 1 0 0 0 0 1 0 0 ...
 $ Tcyclophosphamid: num  0 0 0 1 0 0 0 0 0 0 ...
 $ Tchemotherapi   : num  0 0 0 0 1 1 0 0 0 0 ...
 $ Tcombin         : num  0 0 0 0 1 0 1 0 0 0 ...
 $ Teffect         : num  0 0 0 0 1 0 0 1 0 1 ...
 $ Tmetastat       : num  0 0 0 0 1 0 0 0 0 0 ...
 $ Tpatient        : num  0 0 0 0 1 0 1 0 1 1 ...
 $ Trespons        : num  0 0 0 0 0 1 0 0 0 0 ...
 $ Tadvanc         : num  0 0 0 0 0 0 1 0 0 0 ...
 $ Tpostmenopaus   : num  0 0 0 0 0 0 0 1 1 0 ...
 $ Trandomis       : num  0 0 0 0 0 0 0 1 1 0 ...
 $ Tstudi          : num  0 0 0 0 0 0 0 1 0 0 ...
 $ Ttamoxifen      : num  0 0 0 0 0 0 0 2 1 0 ...
 $ Twomen          : num  0 0 0 0 0 0 0 1 0 0 ...
 $ Tadjuv          : num  0 0 0 0 0 0 0 0 1 0 ...
 $ Tgroup          : num  0 0 0 0 0 0 0 0 1 1 ...
 $ Ttherapi        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Tcompar         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Tdoxorubicin    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Tdocetaxel      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Tresult         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Tplus           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Tclinic         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ A100            : num  0 1 0 0 0 0 0 0 0 0 ...
 $ Aactiv          : num  0 2 0 1 0 0 1 0 0 0 ...
 $ Aassess         : num  0 2 1 2 0 1 0 0 0 3 ...
 $ Abreast         : num  0 2 3 3 3 4 2 2 2 3 ...
 $ Acarcinoma      : num  0 5 0 0 0 0 0 0 0 2 ...
 $ Acase           : num  0 11 0 0 1 0 0 0 0 0 ...
 $ Acell           : num  0 3 0 0 0 1 0 0 0 0 ...
 $ Aclinic         : num  0 1 0 1 0 0 0 0 0 0 ...
 $ Aconsist        : num  0 1 0 0 0 0 0 0 0 0 ...
 $ Acontrol        : num  0 1 0 0 0 0 0 0 1 0 ...
 $ Adiffer         : num  0 1 2 1 3 0 0 1 0 1 ...
 $ Adiseas         : num  0 1 0 1 3 0 0 1 0 0 ...
 $ Aless           : num  0 4 1 0 0 0 0 0 0 6 ...
 $ Anegat          : num  0 4 0 0 0 3 0 0 0 0 ...
 $ Aobserv         : num  0 2 1 0 1 0 0 0 0 0 ...
 $ Aone            : num  0 1 0 0 0 0 0 0 0 0 ...
 $ Apatient        : num  0 1 9 5 5 6 8 3 2 5 ...
 $ Apopul          : num  0 8 0 0 0 0 0 0 0 0 ...
 $ Aposit          : num  0 2 0 1 0 5 0 0 0 0 ...
 $ Aseven          : num  0 1 0 0 0 0 0 0 0 0 ...
 $ Ashow           : num  0 2 0 0 1 0 1 0 0 3 ...
 $ Astudi          : num  0 1 1 1 0 1 3 2 0 1 ...
 $ Atest           : num  0 2 1 1 0 0 0 0 0 0 ...
 $ Atherapi        : num  0 2 0 1 0 0 0 0 0 0 ...
 $ A500            : num  0 0 1 0 2 0 0 0 0 0 ...
 $ Aaddit          : num  0 0 2 0 0 0 0 0 0 0 ...
 $ Aamong          : num  0 0 2 4 0 0 0 0 1 0 ...
 $ Aarm            : num  0 0 7 4 2 0 0 1 0 1 ...
 $ Aassign         : num  0 0 2 1 0 0 0 1 1 1 ...
 $ Aassoci         : num  0 0 1 3 0 2 0 1 2 0 ...
 $ Abackground     : num  0 0 1 1 1 0 0 1 1 0 ...
 $ Abetter         : num  0 0 1 0 1 1 0 0 0 0 ...
 $ Acancer         : num  0 0 2 3 3 3 2 2 3 0 ...
 $ Achang          : num  0 0 1 0 0 0 0 0 0 0 ...
 $ Achemotherapi   : num  0 0 1 2 3 5 2 0 1 0 ...
 $ Acompar         : num  0 0 1 2 0 0 0 1 1 1 ...
 $ Acomplet        : num  0 0 3 0 1 1 1 2 0 0 ...
 $ Aconclus        : num  0 0 1 0 1 0 0 0 0 0 ...
 $ Aconfid         : num  0 0 1 1 0 0 0 0 0 0 ...
 $ Acontinu        : num  0 0 2 0 1 0 0 2 0 0 ...
 $ Acycl           : num  0 0 6 0 2 0 1 0 0 0 ...
 $ Acyclophosphamid: num  0 0 1 1 1 1 3 0 0 0 ...
 $ Aday            : num  0 0 1 0 0 0 3 0 0 0 ...
 $ Adecreas        : num  0 0 1 0 0 0 0 0 1 0 ...
 $ Adefin          : num  0 0 2 0 0 0 0 0 0 0 ...
 $ Ademonstr       : num  0 0 1 0 0 0 0 0 0 0 ...
 $ Adocetaxel      : num  0 0 1 0 0 0 3 0 0 0 ...
 $ Adoxorubicin    : num  0 0 1 0 0 1 0 0 0 0 ...
 $ Aeffect         : num  0 0 2 0 0 1 0 2 0 1 ...
 $ Aefficaci       : num  0 0 1 0 0 0 0 0 0 0 ...
 $ Aenrol          : num  0 0 1 0 0 0 1 0 0 1 ...
 $ Afour           : num  0 0 4 0 0 0 0 0 0 0 ...
 $ Ahematolog      : num  0 0 1 0 0 0 0 0 0 0 ...
 $ Ainiti          : num  0 0 3 0 0 0 0 1 0 0 ...
 $ Ainterv         : num  0 0 1 1 0 0 0 0 0 0 ...
 $ Aleast          : num  0 0 2 0 0 0 0 0 0 0 ...
 $ Alymph          : num  0 0 1 4 0 3 0 0 1 0 ...
 $ Amethod         : num  0 0 1 0 1 0 0 1 1 0 ...
 $ Amgm2           : num  0 0 5 0 4 0 9 0 0 0 ...
 $ Aneoadjuv       : num  0 0 2 0 0 0 0 0 0 0 ...
 $ Anode           : num  0 0 1 3 0 0 0 0 1 0 ...
 $ Anumber         : num  0 0 1 1 0 2 1 0 0 0 ...
 $ Aoutcom         : num  0 0 2 0 0 0 0 0 0 2 ...
 $ Apatholog       : num  0 0 3 0 0 1 0 0 0 0 ...
 $ Aper            : num  0 0 1 0 0 0 0 0 0 0 ...
 $ Aprevious       : num  0 0 1 0 1 0 2 0 0 0 ...
 $ Arandom         : num  0 0 2 1 1 0 0 1 1 2 ...
 $ Arate           : num  0 0 1 1 2 0 1 0 1 0 ...
  [list output truncated]

Now, add the dependent variable “trial” to dtm, copying it from the original dataframe called trials.

How many columns are in this combined dataframe?

dtm$trial <- trials$trial
str(dtm)

'data.frame':   1860 obs. of  367 variables:
 $ Tcancer         : num  1 0 1 1 1 1 0 1 1 2 ...
 $ Ttreatment      : num  1 0 0 0 1 0 0 0 0 1 ...
 $ Tbreast         : num  0 0 1 1 1 1 0 1 1 1 ...
 $ Tearli          : num  0 0 1 1 0 0 0 1 0 0 ...
 $ Tiii            : num  0 0 1 0 0 0 0 0 0 1 ...
 $ Tphase          : num  0 0 1 1 0 0 0 0 0 1 ...
 $ Trandom         : num  0 0 1 1 1 0 0 0 0 1 ...
 $ Ttrial          : num  0 0 1 1 1 0 0 1 1 1 ...
 $ Tversus         : num  0 0 1 0 0 0 0 1 0 0 ...
 $ Tcyclophosphamid: num  0 0 0 1 0 0 0 0 0 0 ...
 $ Tchemotherapi   : num  0 0 0 0 1 1 0 0 0 0 ...
 $ Tcombin         : num  0 0 0 0 1 0 1 0 0 0 ...
 $ Teffect         : num  0 0 0 0 1 0 0 1 0 1 ...
 $ Tmetastat       : num  0 0 0 0 1 0 0 0 0 0 ...
 $ Tpatient        : num  0 0 0 0 1 0 1 0 1 1 ...
 $ Trespons        : num  0 0 0 0 0 1 0 0 0 0 ...
 $ Tadvanc         : num  0 0 0 0 0 0 1 0 0 0 ...
 $ Tpostmenopaus   : num  0 0 0 0 0 0 0 1 1 0 ...
 $ Trandomis       : num  0 0 0 0 0 0 0 1 1 0 ...
 $ Tstudi          : num  0 0 0 0 0 0 0 1 0 0 ...
 $ Ttamoxifen      : num  0 0 0 0 0 0 0 2 1 0 ...
 $ Twomen          : num  0 0 0 0 0 0 0 1 0 0 ...
 $ Tadjuv          : num  0 0 0 0 0 0 0 0 1 0 ...
 $ Tgroup          : num  0 0 0 0 0 0 0 0 1 1 ...
 $ Ttherapi        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Tcompar         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Tdoxorubicin    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Tdocetaxel      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Tresult         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Tplus           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Tclinic         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ A100            : num  0 1 0 0 0 0 0 0 0 0 ...
 $ Aactiv          : num  0 2 0 1 0 0 1 0 0 0 ...
 $ Aassess         : num  0 2 1 2 0 1 0 0 0 3 ...
 $ Abreast         : num  0 2 3 3 3 4 2 2 2 3 ...
 $ Acarcinoma      : num  0 5 0 0 0 0 0 0 0 2 ...
 $ Acase           : num  0 11 0 0 1 0 0 0 0 0 ...
 $ Acell           : num  0 3 0 0 0 1 0 0 0 0 ...
 $ Aclinic         : num  0 1 0 1 0 0 0 0 0 0 ...
 $ Aconsist        : num  0 1 0 0 0 0 0 0 0 0 ...
 $ Acontrol        : num  0 1 0 0 0 0 0 0 1 0 ...
 $ Adiffer         : num  0 1 2 1 3 0 0 1 0 1 ...
 $ Adiseas         : num  0 1 0 1 3 0 0 1 0 0 ...
 $ Aless           : num  0 4 1 0 0 0 0 0 0 6 ...
 $ Anegat          : num  0 4 0 0 0 3 0 0 0 0 ...
 $ Aobserv         : num  0 2 1 0 1 0 0 0 0 0 ...
 $ Aone            : num  0 1 0 0 0 0 0 0 0 0 ...
 $ Apatient        : num  0 1 9 5 5 6 8 3 2 5 ...
 $ Apopul          : num  0 8 0 0 0 0 0 0 0 0 ...
 $ Aposit          : num  0 2 0 1 0 5 0 0 0 0 ...
 $ Aseven          : num  0 1 0 0 0 0 0 0 0 0 ...
 $ Ashow           : num  0 2 0 0 1 0 1 0 0 3 ...
 $ Astudi          : num  0 1 1 1 0 1 3 2 0 1 ...
 $ Atest           : num  0 2 1 1 0 0 0 0 0 0 ...
 $ Atherapi        : num  0 2 0 1 0 0 0 0 0 0 ...
 $ A500            : num  0 0 1 0 2 0 0 0 0 0 ...
 $ Aaddit          : num  0 0 2 0 0 0 0 0 0 0 ...
 $ Aamong          : num  0 0 2 4 0 0 0 0 1 0 ...
 $ Aarm            : num  0 0 7 4 2 0 0 1 0 1 ...
 $ Aassign         : num  0 0 2 1 0 0 0 1 1 1 ...
 $ Aassoci         : num  0 0 1 3 0 2 0 1 2 0 ...
 $ Abackground     : num  0 0 1 1 1 0 0 1 1 0 ...
 $ Abetter         : num  0 0 1 0 1 1 0 0 0 0 ...
 $ Acancer         : num  0 0 2 3 3 3 2 2 3 0 ...
 $ Achang          : num  0 0 1 0 0 0 0 0 0 0 ...
 $ Achemotherapi   : num  0 0 1 2 3 5 2 0 1 0 ...
 $ Acompar         : num  0 0 1 2 0 0 0 1 1 1 ...
 $ Acomplet        : num  0 0 3 0 1 1 1 2 0 0 ...
 $ Aconclus        : num  0 0 1 0 1 0 0 0 0 0 ...
 $ Aconfid         : num  0 0 1 1 0 0 0 0 0 0 ...
 $ Acontinu        : num  0 0 2 0 1 0 0 2 0 0 ...
 $ Acycl           : num  0 0 6 0 2 0 1 0 0 0 ...
 $ Acyclophosphamid: num  0 0 1 1 1 1 3 0 0 0 ...
 $ Aday            : num  0 0 1 0 0 0 3 0 0 0 ...
 $ Adecreas        : num  0 0 1 0 0 0 0 0 1 0 ...
 $ Adefin          : num  0 0 2 0 0 0 0 0 0 0 ...
 $ Ademonstr       : num  0 0 1 0 0 0 0 0 0 0 ...
 $ Adocetaxel      : num  0 0 1 0 0 0 3 0 0 0 ...
 $ Adoxorubicin    : num  0 0 1 0 0 1 0 0 0 0 ...
 $ Aeffect         : num  0 0 2 0 0 1 0 2 0 1 ...
 $ Aefficaci       : num  0 0 1 0 0 0 0 0 0 0 ...
 $ Aenrol          : num  0 0 1 0 0 0 1 0 0 1 ...
 $ Afour           : num  0 0 4 0 0 0 0 0 0 0 ...
 $ Ahematolog      : num  0 0 1 0 0 0 0 0 0 0 ...
 $ Ainiti          : num  0 0 3 0 0 0 0 1 0 0 ...
 $ Ainterv         : num  0 0 1 1 0 0 0 0 0 0 ...
 $ Aleast          : num  0 0 2 0 0 0 0 0 0 0 ...
 $ Alymph          : num  0 0 1 4 0 3 0 0 1 0 ...
 $ Amethod         : num  0 0 1 0 1 0 0 1 1 0 ...
 $ Amgm2           : num  0 0 5 0 4 0 9 0 0 0 ...
 $ Aneoadjuv       : num  0 0 2 0 0 0 0 0 0 0 ...
 $ Anode           : num  0 0 1 3 0 0 0 0 1 0 ...
 $ Anumber         : num  0 0 1 1 0 2 1 0 0 0 ...
 $ Aoutcom         : num  0 0 2 0 0 0 0 0 0 2 ...
 $ Apatholog       : num  0 0 3 0 0 1 0 0 0 0 ...
 $ Aper            : num  0 0 1 0 0 0 0 0 0 0 ...
 $ Aprevious       : num  0 0 1 0 1 0 2 0 0 0 ...
 $ Arandom         : num  0 0 2 1 1 0 0 1 1 2 ...
 $ Arate           : num  0 0 1 1 2 0 1 0 1 0 ...
  [list output truncated]

367

Problem 3.3 - Building a Model

Now that we have prepared our dataframe, it’s time to split it into a training and testing set and to build regression models. Set the random seed to 144 and use the sample.split function from the caTools package to split dtm into dataframes named “train” and “test”, putting 70% of the data in the training set.

set.seed(144)
trialSplit <- sample.split(dtm$trial, 0.7)
train <- subset(dtm, trialSplit == TRUE)
test <- subset(dtm, trialSplit == FALSE)

What is the accuracy of the baseline model on the training set? (Remember that the baseline model predicts the most frequent outcome in the training set for all observations.)

table(train$trial)


  0   1 
730 572

730 / (730 + 572)

[1] 0.5606759

Problem 3.4 - Building a Model

Build a CART model called trialCART, using all the independent variables in the training set to train the model, and then plot the CART model. Just use the default parameters to build the model (don’t add a minbucket or cp value). Remember to add the method=“class” argument, since this is a classification problem.

trialCART <- rpart(trial ~ ., data = train, method = "class")

What is the name of the first variable the model split on?

prp(trialCART)

Tphase

Problem 3.5 - Building a Model

Obtain the training set predictions for the model (do not yet predict on the test-set). Extract the predicted probability of a result being a trial (recall that this involves not setting a type argument, and keeping only the second column of the predict output).

What is the maximum predicted probability for any result?

predTrain <- predict(trialCART)
predTrain[1:10,]

           0          1
1  0.8636364 0.13636364
2  0.8636364 0.13636364
3  0.1281139 0.87188612
5  0.2176871 0.78231293
6  0.9454545 0.05454545
7  0.2176871 0.78231293
10 0.1281139 0.87188612
12 0.7125000 0.28750000
13 0.1281139 0.87188612
14 0.7125000 0.28750000

predTrainProb <- predTrain[, 2]
max(predTrainProb)

[1] 0.8718861

summary(predTrainProb)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.05455 0.13636 0.28750 0.43932 0.78231 0.87189

Problem 3.6 - Building a Model

Without running the analysis, how do you expect the maximum predicted probability to differ in the testing set? #### The maximum predicted probability will likely be exactly the same in the testing set. Because the CART tree assigns the same predicted probability to each leaf node and there are a small number of leaf nodes compared to data points, we expect exactly the same maximum predicted probability.

Problem 3.7 - Building a Model

For these questions, use a threshold probability of 0.5 to predict that an observation is a clinical trial.

table(train$trial, predTrainProb >= 0.5)

   
    FALSE TRUE
  0   631   99
  1   131  441

What is the training set accuracy of the CART model?

(631 + 441) / nrow(train)

[1] 0.8233487

What is the training set sensitivity of the CART model?

441 / (441 + 131)

[1] 0.770979

What is the training set specificity of the CART model?

631 / (631 + 99)

[1] 0.8643836

Problem 4.1 - Evaluating the model on the testing set

Evaluate the CART model on the testing set using the predict function and creating a vector of predicted probabilities predTest.

pred <- predict(trialCART, newdata = test)
pred[1:10,]

           0         1
4  0.1281139 0.8718861
8  0.8636364 0.1363636
9  0.7125000 0.2875000
11 0.7125000 0.2875000
19 0.7125000 0.2875000
31 0.8636364 0.1363636
40 0.8636364 0.1363636
42 0.8636364 0.1363636
43 0.8636364 0.1363636
48 0.7125000 0.2875000

predTest <- pred[, 2]

What is the testing set accuracy, assuming a probability threshold of 0.5 for predicting that a result is a clinical trial?

table(test$trial, predTest >= 0.5)

   
    FALSE TRUE
  0   261   52
  1    83  162

(261 + 162) / nrow(test)

[1] 0.7580645

Problem 4.2 - Evaluating the Model on the Testing Set

Using the ROCR package, what is the testing set AUC of the prediction model?

predROCR <- prediction(predTest, test$trial)
performance(predROCR, "auc")@y.values

[[1]]
[1] 0.8371063

Problem 5.1 - Decision-Maker Tradeoffs

What is the cost associated with the model in Step 1 making a false negative prediction? #### A paper that should have been included in Set A will be missed, affecting the quality of the results of Step 3.

Problem 5.2 - Decision-Maker Tradeoffs

What is the cost associated with the model in Step 1 making a false positive prediction? #### A paper will be mistakenly added to Set A, yielding additional work in Step 2 of the process but not affecting the quality of the results of Step 3.

Problem 5.3 - Decision-Maker Tradeoffs

Given the costs associated with false positives and false negatives, which of the following is most accurate? #### A false negative is more costly than a false positive; the decision maker should use a probability threshold less than 0.5 for the machine learning model.

R Data Analytics Machine Learning

Rihad Variawa

Data Scientist

I am the Sr. Data Scientist at Malastare AI and head of global Fintech Research, responsible for overall vision and strategy, investment priorities and offering development. Working in the financial services industry, helping clients adopt new technologies that can transform the way they transact and engage with their customers. I am passionate about data science, super inquisitive and challenge seeker; looking at everything through a lens of numbers and problem-solver at the core. From understanding a business problem to collecting and visualizing data, until the stage of prototyping, fine-tuning and deploying models to real-world applications, I find the fulfillment of tackling challenges to solve complex problems using data.