The medical literature is enormous! Pubmed, a database of medical publications maintained by the U.S. National Library of Medicine, has indexed over 23 million medical publications. Further, the rate of medical publication has increased over time, and now there are nearly 1 million new publications in the field each year, or more than one per minute.
The large size and fast-changing nature of the medical literature has increased the need for reviews, which search databases like Pubmed for papers on a particular topic and then report results from the papers found. While such reviews are often performed manually, with multiple people reviewing each search result, this is tedious and time consuming. In this analysis, I’ll see how text analytics can be used to automate the process of information retrieval.
The dataset consists of the titles (variable title) and abstracts (variable abstract) of papers retrieved in a Pubmed search. Each search result is labeled with whether the paper is a clinical trial testing a drug therapy for cancer (variable trial). These labels were obtained by two people reviewing each search result and accessing the actual paper if necessary, as part of a literature review of clinical trials testing drug therapies for advanced and metastatic breast cancer.
Loading the packages
Problem 1.1 - Loading the Data
Load clinical_trial.csv into a dataframe called trials (remembering to add the argument stringsAsFactors=FALSE when working with text analytics, so that the text is read in properply), and investigate the dataframe.
trials <- read.csv("clinical_trial.csv", stringsAsFactors = FALSE)
str(trials)
'data.frame': 1860 obs. of 3 variables:
$ title : chr "Treatment of Hodgkin's disease and other cancers with 1,3-bis(2-chloroethyl)-1-nitrosourea (BCNU; NSC-409962)." "Cell mediated immune status in malignancy--pretherapy and post-therapy assessment." "Neoadjuvant vinorelbine-capecitabine versus docetaxel-doxorubicin-cyclophosphamide in early nonresponsive breas"| __truncated__ "Randomized phase 3 trial of fluorouracil, epirubicin, and cyclophosphamide alone or followed by Paclitaxel for "| __truncated__ ...
$ abstract: chr "" "Twenty-eight cases of malignancies of different kinds were studied to assess T-cell activity and population bef"| __truncated__ "BACKGROUND: Among breast cancer patients, nonresponse to initial neoadjuvant chemotherapy is associated with un"| __truncated__ "BACKGROUND: Taxanes are among the most active drugs for the treatment of metastatic breast cancer, and, as a co"| __truncated__ ...
$ trial : int 1 0 1 1 1 0 1 0 0 0 ...
summary(trials)
title abstract trial
Length:1860 Length:1860 Min. :0.0000
Class :character Class :character 1st Qu.:0.0000
Mode :character Mode :character Median :0.0000
Mean :0.4392
3rd Qu.:1.0000
Max. :1.0000
IMPORTANT NOTE: Should you get an error like “invalid multibyte string” when performing certain parts of this analysis, use the argument fileEncoding=“latin1” when reading in the file with read.csv. This should cause those errors to go away. We can use R’s string functions to learn more about the titles and abstracts of the located papers. The nchar() function counts the number of characters in a piece of text.
Using the nchar() function on the variables in the dataframe. How many characters are there in the longest abstract? (Longest here is defined as the abstract with the largest number of characters.)
max(nchar(trials$abstract))
[1] 3708
which.max(nchar(trials$abstract))
[1] 664
trials[664, ]
title
664 Five versus more than five years of tamoxifen therapy for breast cancer patients with negative lymph nodes and estrogen receptor-positive tumors.
abstract
664 BACKGROUND: In 1982, the National Surgical Adjuvant Breast and Bowel Project initiated a randomized, double-blinded, placebo-controlled trial (B-14) to determine the effectiveness of adjuvant tamoxifen therapy in patients with primary operable breast cancer who had estrogen receptor-positive tumors and no axillary lymph node involvement. The findings indicated that tamoxifen therapy provided substantial benefit to patients with early stage disease. However, questions arose about how long the observed benefit would persist, about the duration of therapy necessary to maintain maximum benefit, and about the nature and severity of adverse effects from prolonged treatment.PURPOSE: We evaluated the outcome of patients in the B-14 trial through 10 years of follow-up. In addition, the effects of 5 years versus more than 5 years of tamoxifen therapy were compared.METHODS: In the trial, patients were initially assigned to receive either tamoxifen at 20 mg/day (n = 1404) or placebo (n = 1414). Tamoxifen-treated patients who remained disease free after 5 years of therapy were then reassigned to receive either another 5 years of tamoxifen (n = 322) or 5 years of placebo (n = 321). After the study began, another group of patients who met the same protocol eligibility requirements as the randomly assigned patients were registered to receive tamoxifen (n = 1211). Registered patients who were disease free after 5 years of treatment were also randomly assigned to another 5 years of tamoxifen (n = 261) or to 5 years of placebo (n = 249). To compare 5 years with more than 5 years of tamoxifen therapy, data relating to all patients reassigned to an additional 5 years of the drug were combined. Patients who were not reassigned to either tamoxifen or placebo continued to be followed in the study. Survival, disease-free survival, and distant disease-free survival (relating to failure at distant sites) were estimated by use of the Kaplan-Meier method; differences between the treatment groups were assessed by use of the logrank test. The relative risks of failure (with 95% confidence intervals [CIs]) were determined by use of the Cox proportional hazards model. Reported P values are two-sided.RESULTS: Through 10 years of follow-up, a significant advantage in disease-free survival (69% versus 57%, P < .0001; relative risk = 0.66; 95% CI = 0.58-0.74), distant disease-free survival (76% versus 67%, P < .0001; relative risk = 0.70; 95% CI = 0.61-0.81), and survival (80% versus 76%, P = .02; relative risk = 0.84; 95% CI = 0.71-0.99) was found for patients in the group first assigned to receive tamoxifen. The survival benefit extended to those 49 years of age or younger and to those 50 years of age or older. Tamoxifen therapy was associated with a 37% reduction in the incidence of contralateral (opposite) breast cancer (P = .007). Through 4 years after the reassignment of tamoxifen-treated patients to either continued-therapy or placebo groups, advantages in disease-free survival (92% versus 86%, P = .003) and distant disease-free survival (96% versus 90%, P = .01) were found for those who discontinued tamoxifen treatment. Survival was 96% for those who discontinued tamoxifen compared with 94% for those who continued tamoxifen treatment (P = .08). A higher incidence of thromboembolic events was seen in tamoxifen-treated patients (through 5 years, 1.7% versus 0.4%). Except for endometrial cancer, the incidence of second cancers was not increased with tamoxifen therapy.CONCLUSIONS AND IMPLICATIONS: The benefit from 5 years of tamoxifen therapy persists through 10 years of follow-up. No additional advantage is obtained from continuing tamoxifen therapy for more than 5 years.
trial
664 1
nchar(trials[664, ]$abstract)
[1] 3708
3708
Problem 1.2 - Loading the Data
How many search results provided no abstract? (HINT: A search result provided no abstract if the number of characters in the abstract field is zero.)
sum(nchar(trials$abstract) == 0)
[1] 112
Problem 1.3 - Loading the Data
Find the observation with the minimum number of characters in the title (the variable “title”) out of all of the observations in this dataset. What is the text of the title of this article?
Include capitalization and punctuation in our response, but don’t include the quotes.
min(nchar(trials$title))
[1] 28
which.min(nchar(trials$title))
[1] 1258
trials[1258,]
title
1258 A decade of letrozole: FACE.
abstract
1258 Third-generation nonsteroidal aromatase inhibitors (AIs), letrozole and anastrozole, are superior to tamoxifen as initial therapy for early breast cancer but have not been directly compared in a head-to-head adjuvant trial. Cumulative evidence suggests that AIs are not equivalent in terms of potency of estrogen suppression and that there may be differences in clinical efficacy. Thus, with no data from head-to-head comparisons of the AIs as adjuvant therapy yet available, the question of whether there are efficacy differences between the AIs remains. To help answer this question, the Femara versus Anastrozole Clinical Evaluation (FACE) is a phase IIIb open-label, randomized, multicenter trial designed to test whether letrozole or anastrozole has superior efficacy as adjuvant treatment of postmenopausal women with hormone receptor (HR)- and lymph node-positive breast cancer. Eligible patients (target accrual, N=4,000) are randomized to receive either letrozole 2.5 mg or anastrozole 1 mg daily for up to 5 years. The primary objective is to compare disease-free survival at 5 years. Secondary end points include safety, overall survival, time to distant metastases, and time to contralateral breast cancer. The FACE trial will determine whether or not letrozole offers a greater clinical benefit to postmenopausal women with HR+ early breast cancer at increased risk of early recurrence compared with anastrozole.
trial
1258 0
trials[1258,]$title
[1] "A decade of letrozole: FACE."
Problem 2.1 - Preparing the Corpus
Because we have both title and abstract information for trials, we need to build two corpera instead of one. Naming them corpusTitle and corpusAbstract.
The code performs the following tasks (you might need to load the “tm” package first if it isn’t already loaded). Making sure to perform them in this order.
# 1) Convert the title variable to corpusTitle and the abstract variable to corpusAbstract
corpusTitle <- Corpus(VectorSource(trials$title))
corpusTitle[[1]]$content
[1] "Treatment of Hodgkin's disease and other cancers with 1,3-bis(2-chloroethyl)-1-nitrosourea (BCNU; NSC-409962)."
corpusAbstract <- Corpus(VectorSource(trials$abstract))
corpusAbstract[[1]]$content
[1] ""
# 2) Convert corpusTitle and corpusAbstract to lowercase
corpusTitle = tm_map(corpusTitle, content_transformer(tolower))
Warning in tm_map.SimpleCorpus(corpusTitle, content_transformer(tolower)):
transformation drops documents
corpusAbstract = tm_map(corpusAbstract, content_transformer(tolower))
Warning in tm_map.SimpleCorpus(corpusAbstract,
content_transformer(tolower)): transformation drops documents
#corpusTitle = tm_map(corpusTitle, PlainTextDocument)
#corpusAbstract = tm_map(corpusAbstract, PlainTextDocument)
# 3) Remove the punctuation in corpusTitle and corpusAbstract
corpusTitle = tm_map(corpusTitle, removePunctuation)
Warning in tm_map.SimpleCorpus(corpusTitle, removePunctuation):
transformation drops documents
corpusTitle[[2]]$content
[1] "cell mediated immune status in malignancypretherapy and posttherapy assessment"
corpusAbstract = tm_map(corpusAbstract, removePunctuation)
Warning in tm_map.SimpleCorpus(corpusAbstract, removePunctuation):
transformation drops documents
# 4) Remove the English language stop words from corpusTitle and corpusAbstract
corpusTitle <- tm_map(corpusTitle, removeWords, stopwords("english"))
Warning in tm_map.SimpleCorpus(corpusTitle, removeWords,
stopwords("english")): transformation drops documents
corpusTitle[[2]]$content
[1] "cell mediated immune status malignancypretherapy posttherapy assessment"
corpusAbstract <- tm_map(corpusAbstract, removeWords, stopwords("english"))
Warning in tm_map.SimpleCorpus(corpusAbstract, removeWords,
stopwords("english")): transformation drops documents
corpusAbstract[[2]]$content
[1] "twentyeight cases malignancies different kinds studied assess tcell activity population institution therapy fifteen cases diagnosed nonmetastasising squamous cell carcinoma larynx pharynx laryngopharynx hypopharynx tonsils seven cases nonmetastasising infiltrating duct carcinoma breast 6 cases nonhodgkins lymphoma nhl observed 3 15 cases 20 squamous cell carcinoma cases mantoux test mt negative tcell population less 40 2 7 cases 286 infiltrating duct carcinoma breast mt negative tcell population less 40 3 6 cases 50 nhl mt negative tcell population less 40 normal controls consisting apparently normal healthy adults tcell population 40 mt positive patients showed negative skin test tcell population less 40 subjected assessment tcell population activity appropriate therapy clinical cure disease observed 2 3 cases 6666 squamous cell carcinomas 2 2 cases 100 adenocarcinomas one 3 cases 3333 nhl showed positive conversion tcell population 40"
# 5) Stem the words in corpusTitle and corpusAbstract (each stemming might take a few minutes)
corpusTitle = tm_map(corpusTitle, stemDocument)
Warning in tm_map.SimpleCorpus(corpusTitle, stemDocument): transformation
drops documents
corpusTitle[[2]]$content
[1] "cell mediat immun status malignancypretherapi posttherapi assess"
corpusAbstract = tm_map(corpusAbstract, stemDocument)
Warning in tm_map.SimpleCorpus(corpusAbstract, stemDocument):
transformation drops documents
corpusAbstract[[2]]$content
[1] "twentyeight case malign differ kind studi assess tcell activ popul institut therapi fifteen case diagnos nonmetastasis squamous cell carcinoma larynx pharynx laryngopharynx hypopharynx tonsil seven case nonmetastasis infiltr duct carcinoma breast 6 case nonhodgkin lymphoma nhl observ 3 15 case 20 squamous cell carcinoma case mantoux test mt negat tcell popul less 40 2 7 case 286 infiltr duct carcinoma breast mt negat tcell popul less 40 3 6 case 50 nhl mt negat tcell popul less 40 normal control consist appar normal healthi adult tcell popul 40 mt posit patient show negat skin test tcell popul less 40 subject assess tcell popul activ appropri therapi clinic cure diseas observ 2 3 case 6666 squamous cell carcinoma 2 2 case 100 adenocarcinoma one 3 case 3333 nhl show posit convers tcell popul 40"
# 6) Build a document term matrix called dtmTitle from corpusTitle and dtmAbstract from corpusAbstract
dtmTitle = DocumentTermMatrix(corpusTitle)
dtmTitle
<<DocumentTermMatrix (documents: 1860, terms: 2836)>>
Non-/sparse entries: 23416/5251544
Sparsity : 100%
Maximal term length: 49
Weighting : term frequency (tf)
dtmAbstract = DocumentTermMatrix(corpusAbstract)
dtmAbstract
<<DocumentTermMatrix (documents: 1860, terms: 12451)>>
Non-/sparse entries: 153290/23005570
Sparsity : 99%
Maximal term length: 67
Weighting : term frequency (tf)
# 7) Limit dtmTitle and dtmAbstract to terms with sparseness of at most 95% (aka terms that appear in at least 5% of documents)
dtmTitle <- removeSparseTerms(dtmTitle, 0.95)
dtmTitle
<<DocumentTermMatrix (documents: 1860, terms: 31)>>
Non-/sparse entries: 10683/46977
Sparsity : 81%
Maximal term length: 15
Weighting : term frequency (tf)
dtmAbstract <- removeSparseTerms(dtmAbstract, 0.95)
dtmAbstract
<<DocumentTermMatrix (documents: 1860, terms: 335)>>
Non-/sparse entries: 91969/531131
Sparsity : 85%
Maximal term length: 15
Weighting : term frequency (tf)
# 8) Convert dtmTitle and dtmAbstract to data frames (keep the names dtmTitle and dtmAbstract)
dtmTitle <- as.data.frame(as.matrix(dtmTitle))
dtmAbstract <- as.data.frame(as.matrix(dtmAbstract))
When removing stop words, use tm_map(corpusTitle, removeWords, sw) and tm_map(corpusAbstract, removeWords, sw) instead of tm_map(corpusTitle, removeWords, stopwords(“english”)) and tm_map(corpusAbstract, removeWords, stopwords(“english”)).
length(stopwords("english"))
[1] 174
How many terms remain in dtmTitle after removing sparse terms (aka how many columns does it have)?
str(dtmTitle)
'data.frame': 1860 obs. of 31 variables:
$ cancer : num 1 0 1 1 1 1 0 1 1 2 ...
$ treatment : num 1 0 0 0 1 0 0 0 0 1 ...
$ breast : num 0 0 1 1 1 1 0 1 1 1 ...
$ earli : num 0 0 1 1 0 0 0 1 0 0 ...
$ iii : num 0 0 1 0 0 0 0 0 0 1 ...
$ phase : num 0 0 1 1 0 0 0 0 0 1 ...
$ random : num 0 0 1 1 1 0 0 0 0 1 ...
$ trial : num 0 0 1 1 1 0 0 1 1 1 ...
$ versus : num 0 0 1 0 0 0 0 1 0 0 ...
$ cyclophosphamid: num 0 0 0 1 0 0 0 0 0 0 ...
$ chemotherapi : num 0 0 0 0 1 1 0 0 0 0 ...
$ combin : num 0 0 0 0 1 0 1 0 0 0 ...
$ effect : num 0 0 0 0 1 0 0 1 0 1 ...
$ metastat : num 0 0 0 0 1 0 0 0 0 0 ...
$ patient : num 0 0 0 0 1 0 1 0 1 1 ...
$ respons : num 0 0 0 0 0 1 0 0 0 0 ...
$ advanc : num 0 0 0 0 0 0 1 0 0 0 ...
$ postmenopaus : num 0 0 0 0 0 0 0 1 1 0 ...
$ randomis : num 0 0 0 0 0 0 0 1 1 0 ...
$ studi : num 0 0 0 0 0 0 0 1 0 0 ...
$ tamoxifen : num 0 0 0 0 0 0 0 2 1 0 ...
$ women : num 0 0 0 0 0 0 0 1 0 0 ...
$ adjuv : num 0 0 0 0 0 0 0 0 1 0 ...
$ group : num 0 0 0 0 0 0 0 0 1 1 ...
$ therapi : num 0 0 0 0 0 0 0 0 0 0 ...
$ compar : num 0 0 0 0 0 0 0 0 0 0 ...
$ doxorubicin : num 0 0 0 0 0 0 0 0 0 0 ...
$ docetaxel : num 0 0 0 0 0 0 0 0 0 0 ...
$ result : num 0 0 0 0 0 0 0 0 0 0 ...
$ plus : num 0 0 0 0 0 0 0 0 0 0 ...
$ clinic : num 0 0 0 0 0 0 0 0 0 0 ...
31
How many terms remain in dtmAbstract?
str(dtmAbstract)
'data.frame': 1860 obs. of 335 variables:
$ 100 : num 0 1 0 0 0 0 0 0 0 0 ...
$ activ : num 0 2 0 1 0 0 1 0 0 0 ...
$ assess : num 0 2 1 2 0 1 0 0 0 3 ...
$ breast : num 0 2 3 3 3 4 2 2 2 3 ...
$ carcinoma : num 0 5 0 0 0 0 0 0 0 2 ...
$ case : num 0 11 0 0 1 0 0 0 0 0 ...
$ cell : num 0 3 0 0 0 1 0 0 0 0 ...
$ clinic : num 0 1 0 1 0 0 0 0 0 0 ...
$ consist : num 0 1 0 0 0 0 0 0 0 0 ...
$ control : num 0 1 0 0 0 0 0 0 1 0 ...
$ differ : num 0 1 2 1 3 0 0 1 0 1 ...
$ diseas : num 0 1 0 1 3 0 0 1 0 0 ...
$ less : num 0 4 1 0 0 0 0 0 0 6 ...
$ negat : num 0 4 0 0 0 3 0 0 0 0 ...
$ observ : num 0 2 1 0 1 0 0 0 0 0 ...
$ one : num 0 1 0 0 0 0 0 0 0 0 ...
$ patient : num 0 1 9 5 5 6 8 3 2 5 ...
$ popul : num 0 8 0 0 0 0 0 0 0 0 ...
$ posit : num 0 2 0 1 0 5 0 0 0 0 ...
$ seven : num 0 1 0 0 0 0 0 0 0 0 ...
$ show : num 0 2 0 0 1 0 1 0 0 3 ...
$ studi : num 0 1 1 1 0 1 3 2 0 1 ...
$ test : num 0 2 1 1 0 0 0 0 0 0 ...
$ therapi : num 0 2 0 1 0 0 0 0 0 0 ...
$ 500 : num 0 0 1 0 2 0 0 0 0 0 ...
$ addit : num 0 0 2 0 0 0 0 0 0 0 ...
$ among : num 0 0 2 4 0 0 0 0 1 0 ...
$ arm : num 0 0 7 4 2 0 0 1 0 1 ...
$ assign : num 0 0 2 1 0 0 0 1 1 1 ...
$ associ : num 0 0 1 3 0 2 0 1 2 0 ...
$ background : num 0 0 1 1 1 0 0 1 1 0 ...
$ better : num 0 0 1 0 1 1 0 0 0 0 ...
$ cancer : num 0 0 2 3 3 3 2 2 3 0 ...
$ chang : num 0 0 1 0 0 0 0 0 0 0 ...
$ chemotherapi : num 0 0 1 2 3 5 2 0 1 0 ...
$ compar : num 0 0 1 2 0 0 0 1 1 1 ...
$ complet : num 0 0 3 0 1 1 1 2 0 0 ...
$ conclus : num 0 0 1 0 1 0 0 0 0 0 ...
$ confid : num 0 0 1 1 0 0 0 0 0 0 ...
$ continu : num 0 0 2 0 1 0 0 2 0 0 ...
$ cycl : num 0 0 6 0 2 0 1 0 0 0 ...
$ cyclophosphamid: num 0 0 1 1 1 1 3 0 0 0 ...
$ day : num 0 0 1 0 0 0 3 0 0 0 ...
$ decreas : num 0 0 1 0 0 0 0 0 1 0 ...
$ defin : num 0 0 2 0 0 0 0 0 0 0 ...
$ demonstr : num 0 0 1 0 0 0 0 0 0 0 ...
$ docetaxel : num 0 0 1 0 0 0 3 0 0 0 ...
$ doxorubicin : num 0 0 1 0 0 1 0 0 0 0 ...
$ effect : num 0 0 2 0 0 1 0 2 0 1 ...
$ efficaci : num 0 0 1 0 0 0 0 0 0 0 ...
$ enrol : num 0 0 1 0 0 0 1 0 0 1 ...
$ four : num 0 0 4 0 0 0 0 0 0 0 ...
$ hematolog : num 0 0 1 0 0 0 0 0 0 0 ...
$ initi : num 0 0 3 0 0 0 0 1 0 0 ...
$ interv : num 0 0 1 1 0 0 0 0 0 0 ...
$ least : num 0 0 2 0 0 0 0 0 0 0 ...
$ lymph : num 0 0 1 4 0 3 0 0 1 0 ...
$ method : num 0 0 1 0 1 0 0 1 1 0 ...
$ mgm2 : num 0 0 5 0 4 0 9 0 0 0 ...
$ neoadjuv : num 0 0 2 0 0 0 0 0 0 0 ...
$ node : num 0 0 1 3 0 0 0 0 1 0 ...
$ number : num 0 0 1 1 0 2 1 0 0 0 ...
$ outcom : num 0 0 2 0 0 0 0 0 0 2 ...
$ patholog : num 0 0 3 0 0 1 0 0 0 0 ...
$ per : num 0 0 1 0 0 0 0 0 0 0 ...
$ previous : num 0 0 1 0 1 0 2 0 0 0 ...
$ random : num 0 0 2 1 1 0 0 1 1 2 ...
$ rate : num 0 0 1 1 2 0 1 0 1 0 ...
$ receiv : num 0 0 2 0 1 1 3 0 0 2 ...
$ reduct : num 0 0 1 2 0 0 0 1 0 0 ...
$ regimen : num 0 0 2 0 0 0 1 0 0 0 ...
$ respond : num 0 0 2 0 0 0 0 0 0 0 ...
$ respons : num 0 0 7 0 4 2 2 0 0 0 ...
$ result : num 0 0 1 0 1 0 1 0 0 0 ...
$ similar : num 0 0 2 0 0 0 0 0 1 0 ...
$ size : num 0 0 1 3 0 1 0 0 0 0 ...
$ statist : num 0 0 1 3 0 0 0 0 0 0 ...
$ surgeri : num 0 0 1 1 0 0 0 0 0 0 ...
$ toler : num 0 0 1 0 0 0 1 0 0 0 ...
$ toxic : num 0 0 2 0 1 0 1 0 0 5 ...
$ treatment : num 0 0 3 6 14 0 0 4 1 1 ...
$ tumor : num 0 0 2 4 0 0 2 0 0 0 ...
$ two : num 0 0 3 0 1 0 0 1 0 0 ...
$ 001 : num 0 0 0 1 0 0 0 1 0 0 ...
$ adjuv : num 0 0 0 2 0 1 0 2 4 0 ...
$ age : num 0 0 0 1 0 0 0 0 0 0 ...
$ also : num 0 0 0 1 0 0 0 0 2 0 ...
$ analysi : num 0 0 0 2 0 0 0 1 0 0 ...
$ analyz : num 0 0 0 1 0 0 0 0 0 0 ...
$ axillari : num 0 0 0 1 0 0 0 0 1 0 ...
$ death : num 0 0 0 2 0 0 0 0 1 0 ...
$ dfs : num 0 0 0 3 0 0 0 0 0 0 ...
$ diseasefre : num 0 0 0 1 0 2 0 0 3 0 ...
$ drug : num 0 0 0 1 0 0 0 0 0 0 ...
$ elig : num 0 0 0 1 0 0 0 0 0 0 ...
$ endpoint : num 0 0 0 2 0 0 0 0 1 3 ...
$ epirubicin : num 0 0 0 1 1 0 0 0 0 0 ...
$ estim : num 0 0 0 1 1 0 0 0 0 0 ...
$ fluorouracil : num 0 0 0 1 1 0 0 0 0 0 ...
[list output truncated]
335
Problem 2.2 - Preparing the Corpus
What is the most likely reason why dtmAbstract has so many more terms than dtmTitle? #### Abstracts tend to have many more words than titles
Problem 2.3 - Preparing the Corpus
What is the most frequent word stem across all the abstracts? Hint: you can use colSums() to compute the frequency of a word across all the abstracts.
?colSums
colSums(dtmAbstract)
100 activ assess breast
225 509 668 3859
carcinoma case cell clinic
251 233 359 944
consist control differ diseas
200 621 1176 950
less negat observ one
351 258 700 570
patient popul posit seven
8381 162 511 108
show studi test therapi
516 1965 282 1564
500 addit among arm
169 420 365 1038
assign associ background better
435 604 397 186
cancer chang chemotherapi compar
3726 431 2344 1359
complet conclus confid continu
628 842 241 281
cycl cyclophosphamid day decreas
962 632 1245 350
defin demonstr docetaxel doxorubicin
123 251 514 486
effect efficaci enrol four
1340 591 221 369
hematolog initi interv least
117 275 349 177
lymph method mgm2 neoadjuv
249 892 1093 293
node number outcom patholog
477 296 335 254
per previous random rate
218 355 1520 1253
receiv reduct regimen respond
1908 301 807 200
respons result similar size
2051 1485 438 177
statist surgeri toler toxic
384 407 373 1065
treatment tumor two 001
2894 1122 889 162
adjuv age also analysi
1162 429 364 587
analyz axillari death dfs
124 292 215 310
diseasefre drug elig endpoint
364 332 196 213
epirubicin estim fluorouracil follow
339 139 215 675
found hazard her2 hormon
238 301 314 428
includ involv marker metastat
529 180 189 755
model multivari nodeposit oper
180 154 199 193
overal paclitaxel predict primari
962 397 369 718
prognost proport ratio receptor
242 125 344 573
reduc relaps respect risk
400 254 758 635
sampl secondari signific status
172 158 2043 538
surviv type valu week
1927 126 256 1074
women year agent benefit
1484 1335 240 551
combin detect determin durat
926 148 352 344
either evalu everi evid
532 926 487 150
firstlin howev life may
182 339 178 413
measur object partial perform
411 400 295 342
plus potenti progress prolong
622 156 622 125
qualiti score singl stabl
189 254 149 154
subgroup term time total
192 153 881 397
use vomit whether 0001
1053 174 235 249
5fluorouracil aim correl express
208 185 203 356
factor followup grade group
552 494 580 2668
high independ larg level
378 149 108 743
longer low median month
193 196 1180 1575
premenopaus progesteron randomis remain
303 114 264 158
trend tumour wherea achiev
115 320 173 245
administ advanc daili dose
322 556 412 1123
indic infus intraven neutropenia
269 237 192 234
phase prior support treat
481 305 183 893
trial aromatas caus due
1417 171 115 154
earli end estrogen increas
325 221 421 729
inhibitor lower particip point
182 236 144 202
postmenopaus receptorposit tamoxifen versus
590 152 1632 570
within alon can distant
172 472 191 149
find first higher iii
177 421 415 266
improv limit mastectomi metastas
562 127 165 352
occur postop radiotherapi recurr
312 177 244 465
site stage system advers
183 286 193 256
baselin common doubleblind event
340 191 149 409
greater mean placebo purpos
279 310 475 434
obtain oral present relat
147 422 218 351
suggest bone administr cours
274 514 218 283
given safeti schedul seen
374 265 215 199
standard without although new
305 306 191 171
accord anthracyclin base eight
182 207 124 124
histolog investig local set
127 295 300 191
analys sever three import
177 288 564 138
shown 005 avail data
117 124 108 405
incid period profil report
300 170 158 357
endocrin hundr multicent design
266 195 126 219
growth human pretreat well
208 144 175 328
examin six appear identifi
190 261 164 148
nausea provid regress five
239 155 120 173
conduct prospect develop sequenti
177 239 259 168
serum comparison frequent cmf
315 116 153 586
consid rang select possibl
131 248 124 130
inform major need function
124 122 128 188
failur confirm requir experienc
262 178 168 167
characterist methotrex progressionfre general
119 265 158 126
prevent andor mbc main
143 128 276 131
side superior start tissu
168 161 131 197
second regard enter
138 105 117
max(colSums(dtmAbstract))
[1] 8381
which.max(colSums(dtmAbstract))
patient
17
dtmAbstract$patient
[1] 0 1 9 5 5 6 8 3 2 5 2 4 2 1 1 3 0 8 6 0 3 7 7
[24] 4 2 5 2 5 0 0 3 2 3 5 12 3 5 4 7 2 0 1 2 3 6 5
[47] 1 4 7 6 2 5 6 5 5 6 9 5 1 5 10 6 3 3 6 1 4 4 0
[70] 6 9 5 9 1 11 5 0 3 5 6 3 8 8 2 6 9 3 7 4 1 6 13
[93] 3 12 0 5 5 3 0 1 8 16 5 13 0 7 2 7 0 7 3 2 5 1 6
[116] 13 4 5 4 4 4 6 5 6 5 3 1 3 3 0 5 1 3 9 8 2 3 7
[139] 5 3 4 4 11 2 9 1 1 6 8 1 2 1 1 7 0 2 4 6 16 0 3
[162] 0 3 8 3 5 2 11 5 0 0 2 8 8 1 4 7 11 6 5 6 1 7 2
[185] 14 13 4 23 7 3 6 3 3 2 3 4 11 2 4 12 3 6 4 3 6 9 4
[208] 8 2 4 2 6 1 7 6 5 5 14 5 13 0 11 1 7 6 3 5 4 3 7
[231] 0 0 7 2 4 5 7 0 4 4 16 0 8 3 0 3 7 6 6 2 2 1 5
[254] 4 7 8 1 1 5 4 2 3 16 2 1 5 7 3 4 0 6 3 2 1 5 6
[277] 6 2 9 6 2 6 5 7 9 10 9 5 2 4 5 0 0 3 3 6 4 6 3
[300] 9 6 3 3 8 5 0 7 9 4 2 7 3 5 7 4 0 2 6 8 1 4 5
[323] 2 1 2 2 4 6 5 13 2 0 10 4 4 4 9 7 2 5 3 4 4 4 4
[346] 3 9 2 5 0 9 0 4 5 2 7 5 1 4 5 5 5 4 2 0 2 9 9
[369] 1 0 6 3 7 8 0 2 6 5 2 1 2 1 8 3 6 6 7 17 3 7 6
[392] 3 8 9 1 8 5 10 0 7 4 6 8 3 8 0 9 8 3 1 0 0 7 4
[415] 8 5 4 4 5 5 4 4 0 3 3 5 1 9 3 2 3 3 8 4 9 7 0
[438] 5 1 0 10 7 6 5 7 4 5 2 13 1 6 7 3 4 5 7 0 5 4 4
[461] 2 6 4 3 2 2 5 5 5 0 1 0 8 2 0 2 0 2 8 11 0 2 3
[484] 5 7 6 4 3 6 4 7 10 12 7 11 7 2 8 6 2 4 4 5 5 3 8
[507] 3 0 0 5 4 5 1 4 0 0 6 3 3 3 6 2 3 2 3 4 11 3 5
[530] 7 2 6 6 2 4 6 1 13 5 3 7 4 3 3 4 3 8 3 5 1 5 3
[553] 0 11 5 7 2 1 0 6 4 5 5 4 3 8 4 4 1 4 2 7 9 14 10
[576] 0 6 2 1 8 0 2 5 2 2 1 5 0 10 6 5 0 1 1 4 4 4 0
[599] 0 4 8 3 3 8 7 5 7 1 2 3 3 9 5 11 8 8 6 4 3 2 0
[622] 2 6 9 1 0 12 3 2 12 9 6 3 6 0 4 5 4 1 13 3 0 2 5
[645] 3 1 8 5 9 5 5 9 5 5 7 6 5 6 6 3 0 6 7 13 7 9 5
[668] 6 7 5 2 6 1 3 0 3 4 3 3 0 10 0 7 0 6 3 9 4 7 12
[691] 0 0 7 9 0 7 6 6 3 2 5 2 0 5 5 7 4 5 5 3 6 2 7
[714] 1 2 1 6 4 11 9 2 0 8 10 2 7 2 6 1 8 0 0 4 3 9 5
[737] 2 3 1 7 4 0 0 0 6 3 13 3 5 5 0 4 2 3 6 4 5 0 1
[760] 4 9 12 4 0 10 5 9 4 3 10 6 2 3 5 3 9 3 6 1 5 7 0
[783] 4 5 3 4 1 0 1 3 10 17 0 7 5 2 0 9 0 1 5 8 0 6 9
[806] 2 0 9 10 0 3 3 4 6 5 5 1 1 12 10 4 0 5 4 6 3 7 2
[829] 1 0 0 4 0 10 5 8 9 8 1 3 2 6 1 12 2 0 5 2 9 3 4
[852] 7 5 3 10 6 6 9 0 4 7 4 1 6 0 6 5 8 3 3 0 4 9 8
[875] 2 8 3 11 7 1 5 4 7 6 8 3 1 5 6 6 0 5 20 5 4 13 2
[898] 1 0 2 1 3 4 10 4 5 5 1 9 0 1 0 0 11 3 12 4 6 7 4
[921] 0 3 3 5 6 11 0 1 4 10 2 5 8 1 0 7 6 2 5 4 0 2 1
[944] 6 5 10 3 0 0 3 12 3 4 10 0 5 3 3 9 0 1 3 17 1 8 14
[967] 6 3 3 2 8 0 6 8 1 0 11 10 10 18 0 7 9 2 5 1 7 5 5
[990] 5 6 4 3 7 8 2 10 5 0 10 5 5 0 5 3 1 4 4 3 2 6 1
[1013] 2 6 3 6 14 2 0 6 5 3 4 9 2 5 8 4 5 3 4 7 5 6 2
[1036] 3 4 7 8 11 8 6 9 4 7 5 14 7 2 12 0 5 3 6 1 4 0 1
[1059] 6 5 4 3 0 2 2 1 6 5 2 1 1 9 6 11 4 1 5 5 0 5 10
[1082] 0 5 2 2 4 5 7 1 4 5 2 10 0 0 5 0 2 3 3 6 2 8 3
[1105] 3 3 1 5 6 0 0 2 4 5 2 8 10 7 9 5 10 4 7 10 6 10 6
[1128] 3 0 0 7 6 8 4 4 4 0 8 7 7 1 6 6 10 8 4 4 5 11 7
[1151] 5 7 4 6 4 4 7 3 12 12 0 3 0 4 2 7 4 4 6 0 6 6 1
[1174] 9 5 6 11 6 15 3 2 3 6 4 5 4 1 3 0 3 0 2 6 0 1 7
[1197] 2 8 0 0 0 4 11 9 2 10 3 6 5 2 1 11 0 6 1 5 4 3 2
[1220] 2 6 7 1 11 6 10 5 4 7 10 4 0 4 4 0 2 8 7 5 4 6 4
[1243] 0 1 7 6 5 0 12 7 7 7 5 7 0 11 5 1 8 11 3 1 8 4 3
[1266] 3 7 2 8 2 4 0 7 5 0 1 0 9 7 1 7 4 7 8 3 1 6 3
[1289] 2 4 3 2 3 9 6 2 3 6 5 11 5 7 8 0 6 4 5 4 1 1 3
[1312] 5 2 0 6 3 3 0 6 6 0 7 11 0 3 3 7 5 2 0 3 5 2 0
[1335] 4 4 3 6 4 3 4 5 0 7 1 6 6 6 5 5 13 15 2 7 12 3 2
[1358] 6 5 5 5 9 7 0 5 7 4 8 2 0 2 2 7 8 2 4 0 7 9 4
[1381] 4 6 2 8 4 0 1 10 4 4 5 3 4 6 4 8 2 6 12 5 5 5 4
[1404] 5 11 4 4 7 5 6 7 2 3 6 3 3 2 5 2 4 1 2 2 7 7 4
[1427] 13 4 12 4 7 7 4 2 0 7 6 7 6 7 4 5 5 8 5 2 5 1 3
[1450] 0 10 2 0 5 3 3 11 1 0 3 1 5 0 2 0 3 2 0 6 0 3 2
[1473] 4 11 0 0 1 4 1 3 2 1 5 0 2 3 2 3 10 3 4 2 0 0 9
[1496] 4 8 4 0 2 5 5 0 0 2 0 8 3 0 9 4 3 2 2 4 1 0 6
[1519] 2 2 2 0 3 0 0 6 0 0 4 2 4 12 2 8 6 2 1 0 6 3 4
[1542] 5 0 6 1 1 2 5 1 3 4 11 9 0 6 2 7 5 3 7 5 0 2 7
[1565] 10 2 3 3 5 3 4 2 3 4 3 1 6 2 1 3 1 3 2 3 5 4 3
[1588] 0 1 0 5 1 3 9 1 6 4 1 1 7 2 0 1 7 3 0 5 2 5 2
[1611] 2 2 0 6 0 5 1 4 0 0 1 7 1 4 6 2 0 1 8 2 3 5 0
[1634] 7 7 7 4 7 7 3 5 6 1 6 4 16 5 8 7 3 1 7 3 10 4 7
[1657] 10 4 9 1 10 4 6 9 0 4 5 2 9 5 1 8 4 12 8 6 3 6 10
[1680] 2 3 1 1 13 4 2 5 7 10 7 10 5 4 3 8 4 3 6 5 2 9 6
[1703] 7 4 4 7 3 6 9 5 3 4 4 6 8 4 6 1 8 7 10 6 7 7 1
[1726] 9 6 15 10 10 9 8 6 5 8 5 6 7 4 8 9 8 8 9 10 6 5 6
[1749] 6 9 11 4 4 11 4 5 6 7 4 6 3 7 4 7 9 8 4 5 10 6 2
[1772] 0 8 5 5 6 3 7 4 0 2 4 1 0 3 6 10 3 14 3 10 0 5 0
[1795] 8 0 5 0 9 0 3 0 0 8 3 5 5 2 0 0 4 16 2 7 3 0 0
[1818] 0 7 6 0 2 5 5 0 4 5 10 9 3 5 9 4 1 4 4 2 0 3 2
[1841] 4 4 5 0 9 8 4 0 0 0 8 2 4 4 0 0 0 0 0 0
Problem 3.1 - Building a model
We want to combine dtmTitle and dtmAbstract into a single dataframe to make predictions. However, some of the variables in these dataframes have the same names. To fix this issue, run the following code:
colnames(dtmTitle) <- paste0("T", colnames(dtmTitle))
str(dtmTitle)
'data.frame': 1860 obs. of 31 variables:
$ Tcancer : num 1 0 1 1 1 1 0 1 1 2 ...
$ Ttreatment : num 1 0 0 0 1 0 0 0 0 1 ...
$ Tbreast : num 0 0 1 1 1 1 0 1 1 1 ...
$ Tearli : num 0 0 1 1 0 0 0 1 0 0 ...
$ Tiii : num 0 0 1 0 0 0 0 0 0 1 ...
$ Tphase : num 0 0 1 1 0 0 0 0 0 1 ...
$ Trandom : num 0 0 1 1 1 0 0 0 0 1 ...
$ Ttrial : num 0 0 1 1 1 0 0 1 1 1 ...
$ Tversus : num 0 0 1 0 0 0 0 1 0 0 ...
$ Tcyclophosphamid: num 0 0 0 1 0 0 0 0 0 0 ...
$ Tchemotherapi : num 0 0 0 0 1 1 0 0 0 0 ...
$ Tcombin : num 0 0 0 0 1 0 1 0 0 0 ...
$ Teffect : num 0 0 0 0 1 0 0 1 0 1 ...
$ Tmetastat : num 0 0 0 0 1 0 0 0 0 0 ...
$ Tpatient : num 0 0 0 0 1 0 1 0 1 1 ...
$ Trespons : num 0 0 0 0 0 1 0 0 0 0 ...
$ Tadvanc : num 0 0 0 0 0 0 1 0 0 0 ...
$ Tpostmenopaus : num 0 0 0 0 0 0 0 1 1 0 ...
$ Trandomis : num 0 0 0 0 0 0 0 1 1 0 ...
$ Tstudi : num 0 0 0 0 0 0 0 1 0 0 ...
$ Ttamoxifen : num 0 0 0 0 0 0 0 2 1 0 ...
$ Twomen : num 0 0 0 0 0 0 0 1 0 0 ...
$ Tadjuv : num 0 0 0 0 0 0 0 0 1 0 ...
$ Tgroup : num 0 0 0 0 0 0 0 0 1 1 ...
$ Ttherapi : num 0 0 0 0 0 0 0 0 0 0 ...
$ Tcompar : num 0 0 0 0 0 0 0 0 0 0 ...
$ Tdoxorubicin : num 0 0 0 0 0 0 0 0 0 0 ...
$ Tdocetaxel : num 0 0 0 0 0 0 0 0 0 0 ...
$ Tresult : num 0 0 0 0 0 0 0 0 0 0 ...
$ Tplus : num 0 0 0 0 0 0 0 0 0 0 ...
$ Tclinic : num 0 0 0 0 0 0 0 0 0 0 ...
colnames(dtmAbstract) <- paste0("A", colnames(dtmAbstract))
str(dtmAbstract)
'data.frame': 1860 obs. of 335 variables:
$ A100 : num 0 1 0 0 0 0 0 0 0 0 ...
$ Aactiv : num 0 2 0 1 0 0 1 0 0 0 ...
$ Aassess : num 0 2 1 2 0 1 0 0 0 3 ...
$ Abreast : num 0 2 3 3 3 4 2 2 2 3 ...
$ Acarcinoma : num 0 5 0 0 0 0 0 0 0 2 ...
$ Acase : num 0 11 0 0 1 0 0 0 0 0 ...
$ Acell : num 0 3 0 0 0 1 0 0 0 0 ...
$ Aclinic : num 0 1 0 1 0 0 0 0 0 0 ...
$ Aconsist : num 0 1 0 0 0 0 0 0 0 0 ...
$ Acontrol : num 0 1 0 0 0 0 0 0 1 0 ...
$ Adiffer : num 0 1 2 1 3 0 0 1 0 1 ...
$ Adiseas : num 0 1 0 1 3 0 0 1 0 0 ...
$ Aless : num 0 4 1 0 0 0 0 0 0 6 ...
$ Anegat : num 0 4 0 0 0 3 0 0 0 0 ...
$ Aobserv : num 0 2 1 0 1 0 0 0 0 0 ...
$ Aone : num 0 1 0 0 0 0 0 0 0 0 ...
$ Apatient : num 0 1 9 5 5 6 8 3 2 5 ...
$ Apopul : num 0 8 0 0 0 0 0 0 0 0 ...
$ Aposit : num 0 2 0 1 0 5 0 0 0 0 ...
$ Aseven : num 0 1 0 0 0 0 0 0 0 0 ...
$ Ashow : num 0 2 0 0 1 0 1 0 0 3 ...
$ Astudi : num 0 1 1 1 0 1 3 2 0 1 ...
$ Atest : num 0 2 1 1 0 0 0 0 0 0 ...
$ Atherapi : num 0 2 0 1 0 0 0 0 0 0 ...
$ A500 : num 0 0 1 0 2 0 0 0 0 0 ...
$ Aaddit : num 0 0 2 0 0 0 0 0 0 0 ...
$ Aamong : num 0 0 2 4 0 0 0 0 1 0 ...
$ Aarm : num 0 0 7 4 2 0 0 1 0 1 ...
$ Aassign : num 0 0 2 1 0 0 0 1 1 1 ...
$ Aassoci : num 0 0 1 3 0 2 0 1 2 0 ...
$ Abackground : num 0 0 1 1 1 0 0 1 1 0 ...
$ Abetter : num 0 0 1 0 1 1 0 0 0 0 ...
$ Acancer : num 0 0 2 3 3 3 2 2 3 0 ...
$ Achang : num 0 0 1 0 0 0 0 0 0 0 ...
$ Achemotherapi : num 0 0 1 2 3 5 2 0 1 0 ...
$ Acompar : num 0 0 1 2 0 0 0 1 1 1 ...
$ Acomplet : num 0 0 3 0 1 1 1 2 0 0 ...
$ Aconclus : num 0 0 1 0 1 0 0 0 0 0 ...
$ Aconfid : num 0 0 1 1 0 0 0 0 0 0 ...
$ Acontinu : num 0 0 2 0 1 0 0 2 0 0 ...
$ Acycl : num 0 0 6 0 2 0 1 0 0 0 ...
$ Acyclophosphamid: num 0 0 1 1 1 1 3 0 0 0 ...
$ Aday : num 0 0 1 0 0 0 3 0 0 0 ...
$ Adecreas : num 0 0 1 0 0 0 0 0 1 0 ...
$ Adefin : num 0 0 2 0 0 0 0 0 0 0 ...
$ Ademonstr : num 0 0 1 0 0 0 0 0 0 0 ...
$ Adocetaxel : num 0 0 1 0 0 0 3 0 0 0 ...
$ Adoxorubicin : num 0 0 1 0 0 1 0 0 0 0 ...
$ Aeffect : num 0 0 2 0 0 1 0 2 0 1 ...
$ Aefficaci : num 0 0 1 0 0 0 0 0 0 0 ...
$ Aenrol : num 0 0 1 0 0 0 1 0 0 1 ...
$ Afour : num 0 0 4 0 0 0 0 0 0 0 ...
$ Ahematolog : num 0 0 1 0 0 0 0 0 0 0 ...
$ Ainiti : num 0 0 3 0 0 0 0 1 0 0 ...
$ Ainterv : num 0 0 1 1 0 0 0 0 0 0 ...
$ Aleast : num 0 0 2 0 0 0 0 0 0 0 ...
$ Alymph : num 0 0 1 4 0 3 0 0 1 0 ...
$ Amethod : num 0 0 1 0 1 0 0 1 1 0 ...
$ Amgm2 : num 0 0 5 0 4 0 9 0 0 0 ...
$ Aneoadjuv : num 0 0 2 0 0 0 0 0 0 0 ...
$ Anode : num 0 0 1 3 0 0 0 0 1 0 ...
$ Anumber : num 0 0 1 1 0 2 1 0 0 0 ...
$ Aoutcom : num 0 0 2 0 0 0 0 0 0 2 ...
$ Apatholog : num 0 0 3 0 0 1 0 0 0 0 ...
$ Aper : num 0 0 1 0 0 0 0 0 0 0 ...
$ Aprevious : num 0 0 1 0 1 0 2 0 0 0 ...
$ Arandom : num 0 0 2 1 1 0 0 1 1 2 ...
$ Arate : num 0 0 1 1 2 0 1 0 1 0 ...
$ Areceiv : num 0 0 2 0 1 1 3 0 0 2 ...
$ Areduct : num 0 0 1 2 0 0 0 1 0 0 ...
$ Aregimen : num 0 0 2 0 0 0 1 0 0 0 ...
$ Arespond : num 0 0 2 0 0 0 0 0 0 0 ...
$ Arespons : num 0 0 7 0 4 2 2 0 0 0 ...
$ Aresult : num 0 0 1 0 1 0 1 0 0 0 ...
$ Asimilar : num 0 0 2 0 0 0 0 0 1 0 ...
$ Asize : num 0 0 1 3 0 1 0 0 0 0 ...
$ Astatist : num 0 0 1 3 0 0 0 0 0 0 ...
$ Asurgeri : num 0 0 1 1 0 0 0 0 0 0 ...
$ Atoler : num 0 0 1 0 0 0 1 0 0 0 ...
$ Atoxic : num 0 0 2 0 1 0 1 0 0 5 ...
$ Atreatment : num 0 0 3 6 14 0 0 4 1 1 ...
$ Atumor : num 0 0 2 4 0 0 2 0 0 0 ...
$ Atwo : num 0 0 3 0 1 0 0 1 0 0 ...
$ A001 : num 0 0 0 1 0 0 0 1 0 0 ...
$ Aadjuv : num 0 0 0 2 0 1 0 2 4 0 ...
$ Aage : num 0 0 0 1 0 0 0 0 0 0 ...
$ Aalso : num 0 0 0 1 0 0 0 0 2 0 ...
$ Aanalysi : num 0 0 0 2 0 0 0 1 0 0 ...
$ Aanalyz : num 0 0 0 1 0 0 0 0 0 0 ...
$ Aaxillari : num 0 0 0 1 0 0 0 0 1 0 ...
$ Adeath : num 0 0 0 2 0 0 0 0 1 0 ...
$ Adfs : num 0 0 0 3 0 0 0 0 0 0 ...
$ Adiseasefre : num 0 0 0 1 0 2 0 0 3 0 ...
$ Adrug : num 0 0 0 1 0 0 0 0 0 0 ...
$ Aelig : num 0 0 0 1 0 0 0 0 0 0 ...
$ Aendpoint : num 0 0 0 2 0 0 0 0 1 3 ...
$ Aepirubicin : num 0 0 0 1 1 0 0 0 0 0 ...
$ Aestim : num 0 0 0 1 1 0 0 0 0 0 ...
$ Afluorouracil : num 0 0 0 1 1 0 0 0 0 0 ...
[list output truncated]
What was the effect of these functions? #### Adding the letter T in front of all the title variable names and adding the letter A in front of all the abstract variable names.
Problem 3.2 - Building a Model
Using cbind(), combine dtmTitle and dtmAbstract into a single dataframe called dtm:
dtm <- cbind(dtmTitle, dtmAbstract)
str(dtm)
'data.frame': 1860 obs. of 366 variables:
$ Tcancer : num 1 0 1 1 1 1 0 1 1 2 ...
$ Ttreatment : num 1 0 0 0 1 0 0 0 0 1 ...
$ Tbreast : num 0 0 1 1 1 1 0 1 1 1 ...
$ Tearli : num 0 0 1 1 0 0 0 1 0 0 ...
$ Tiii : num 0 0 1 0 0 0 0 0 0 1 ...
$ Tphase : num 0 0 1 1 0 0 0 0 0 1 ...
$ Trandom : num 0 0 1 1 1 0 0 0 0 1 ...
$ Ttrial : num 0 0 1 1 1 0 0 1 1 1 ...
$ Tversus : num 0 0 1 0 0 0 0 1 0 0 ...
$ Tcyclophosphamid: num 0 0 0 1 0 0 0 0 0 0 ...
$ Tchemotherapi : num 0 0 0 0 1 1 0 0 0 0 ...
$ Tcombin : num 0 0 0 0 1 0 1 0 0 0 ...
$ Teffect : num 0 0 0 0 1 0 0 1 0 1 ...
$ Tmetastat : num 0 0 0 0 1 0 0 0 0 0 ...
$ Tpatient : num 0 0 0 0 1 0 1 0 1 1 ...
$ Trespons : num 0 0 0 0 0 1 0 0 0 0 ...
$ Tadvanc : num 0 0 0 0 0 0 1 0 0 0 ...
$ Tpostmenopaus : num 0 0 0 0 0 0 0 1 1 0 ...
$ Trandomis : num 0 0 0 0 0 0 0 1 1 0 ...
$ Tstudi : num 0 0 0 0 0 0 0 1 0 0 ...
$ Ttamoxifen : num 0 0 0 0 0 0 0 2 1 0 ...
$ Twomen : num 0 0 0 0 0 0 0 1 0 0 ...
$ Tadjuv : num 0 0 0 0 0 0 0 0 1 0 ...
$ Tgroup : num 0 0 0 0 0 0 0 0 1 1 ...
$ Ttherapi : num 0 0 0 0 0 0 0 0 0 0 ...
$ Tcompar : num 0 0 0 0 0 0 0 0 0 0 ...
$ Tdoxorubicin : num 0 0 0 0 0 0 0 0 0 0 ...
$ Tdocetaxel : num 0 0 0 0 0 0 0 0 0 0 ...
$ Tresult : num 0 0 0 0 0 0 0 0 0 0 ...
$ Tplus : num 0 0 0 0 0 0 0 0 0 0 ...
$ Tclinic : num 0 0 0 0 0 0 0 0 0 0 ...
$ A100 : num 0 1 0 0 0 0 0 0 0 0 ...
$ Aactiv : num 0 2 0 1 0 0 1 0 0 0 ...
$ Aassess : num 0 2 1 2 0 1 0 0 0 3 ...
$ Abreast : num 0 2 3 3 3 4 2 2 2 3 ...
$ Acarcinoma : num 0 5 0 0 0 0 0 0 0 2 ...
$ Acase : num 0 11 0 0 1 0 0 0 0 0 ...
$ Acell : num 0 3 0 0 0 1 0 0 0 0 ...
$ Aclinic : num 0 1 0 1 0 0 0 0 0 0 ...
$ Aconsist : num 0 1 0 0 0 0 0 0 0 0 ...
$ Acontrol : num 0 1 0 0 0 0 0 0 1 0 ...
$ Adiffer : num 0 1 2 1 3 0 0 1 0 1 ...
$ Adiseas : num 0 1 0 1 3 0 0 1 0 0 ...
$ Aless : num 0 4 1 0 0 0 0 0 0 6 ...
$ Anegat : num 0 4 0 0 0 3 0 0 0 0 ...
$ Aobserv : num 0 2 1 0 1 0 0 0 0 0 ...
$ Aone : num 0 1 0 0 0 0 0 0 0 0 ...
$ Apatient : num 0 1 9 5 5 6 8 3 2 5 ...
$ Apopul : num 0 8 0 0 0 0 0 0 0 0 ...
$ Aposit : num 0 2 0 1 0 5 0 0 0 0 ...
$ Aseven : num 0 1 0 0 0 0 0 0 0 0 ...
$ Ashow : num 0 2 0 0 1 0 1 0 0 3 ...
$ Astudi : num 0 1 1 1 0 1 3 2 0 1 ...
$ Atest : num 0 2 1 1 0 0 0 0 0 0 ...
$ Atherapi : num 0 2 0 1 0 0 0 0 0 0 ...
$ A500 : num 0 0 1 0 2 0 0 0 0 0 ...
$ Aaddit : num 0 0 2 0 0 0 0 0 0 0 ...
$ Aamong : num 0 0 2 4 0 0 0 0 1 0 ...
$ Aarm : num 0 0 7 4 2 0 0 1 0 1 ...
$ Aassign : num 0 0 2 1 0 0 0 1 1 1 ...
$ Aassoci : num 0 0 1 3 0 2 0 1 2 0 ...
$ Abackground : num 0 0 1 1 1 0 0 1 1 0 ...
$ Abetter : num 0 0 1 0 1 1 0 0 0 0 ...
$ Acancer : num 0 0 2 3 3 3 2 2 3 0 ...
$ Achang : num 0 0 1 0 0 0 0 0 0 0 ...
$ Achemotherapi : num 0 0 1 2 3 5 2 0 1 0 ...
$ Acompar : num 0 0 1 2 0 0 0 1 1 1 ...
$ Acomplet : num 0 0 3 0 1 1 1 2 0 0 ...
$ Aconclus : num 0 0 1 0 1 0 0 0 0 0 ...
$ Aconfid : num 0 0 1 1 0 0 0 0 0 0 ...
$ Acontinu : num 0 0 2 0 1 0 0 2 0 0 ...
$ Acycl : num 0 0 6 0 2 0 1 0 0 0 ...
$ Acyclophosphamid: num 0 0 1 1 1 1 3 0 0 0 ...
$ Aday : num 0 0 1 0 0 0 3 0 0 0 ...
$ Adecreas : num 0 0 1 0 0 0 0 0 1 0 ...
$ Adefin : num 0 0 2 0 0 0 0 0 0 0 ...
$ Ademonstr : num 0 0 1 0 0 0 0 0 0 0 ...
$ Adocetaxel : num 0 0 1 0 0 0 3 0 0 0 ...
$ Adoxorubicin : num 0 0 1 0 0 1 0 0 0 0 ...
$ Aeffect : num 0 0 2 0 0 1 0 2 0 1 ...
$ Aefficaci : num 0 0 1 0 0 0 0 0 0 0 ...
$ Aenrol : num 0 0 1 0 0 0 1 0 0 1 ...
$ Afour : num 0 0 4 0 0 0 0 0 0 0 ...
$ Ahematolog : num 0 0 1 0 0 0 0 0 0 0 ...
$ Ainiti : num 0 0 3 0 0 0 0 1 0 0 ...
$ Ainterv : num 0 0 1 1 0 0 0 0 0 0 ...
$ Aleast : num 0 0 2 0 0 0 0 0 0 0 ...
$ Alymph : num 0 0 1 4 0 3 0 0 1 0 ...
$ Amethod : num 0 0 1 0 1 0 0 1 1 0 ...
$ Amgm2 : num 0 0 5 0 4 0 9 0 0 0 ...
$ Aneoadjuv : num 0 0 2 0 0 0 0 0 0 0 ...
$ Anode : num 0 0 1 3 0 0 0 0 1 0 ...
$ Anumber : num 0 0 1 1 0 2 1 0 0 0 ...
$ Aoutcom : num 0 0 2 0 0 0 0 0 0 2 ...
$ Apatholog : num 0 0 3 0 0 1 0 0 0 0 ...
$ Aper : num 0 0 1 0 0 0 0 0 0 0 ...
$ Aprevious : num 0 0 1 0 1 0 2 0 0 0 ...
$ Arandom : num 0 0 2 1 1 0 0 1 1 2 ...
$ Arate : num 0 0 1 1 2 0 1 0 1 0 ...
[list output truncated]
Now, add the dependent variable “trial” to dtm, copying it from the original dataframe called trials.
How many columns are in this combined dataframe?
dtm$trial <- trials$trial
str(dtm)
'data.frame': 1860 obs. of 367 variables:
$ Tcancer : num 1 0 1 1 1 1 0 1 1 2 ...
$ Ttreatment : num 1 0 0 0 1 0 0 0 0 1 ...
$ Tbreast : num 0 0 1 1 1 1 0 1 1 1 ...
$ Tearli : num 0 0 1 1 0 0 0 1 0 0 ...
$ Tiii : num 0 0 1 0 0 0 0 0 0 1 ...
$ Tphase : num 0 0 1 1 0 0 0 0 0 1 ...
$ Trandom : num 0 0 1 1 1 0 0 0 0 1 ...
$ Ttrial : num 0 0 1 1 1 0 0 1 1 1 ...
$ Tversus : num 0 0 1 0 0 0 0 1 0 0 ...
$ Tcyclophosphamid: num 0 0 0 1 0 0 0 0 0 0 ...
$ Tchemotherapi : num 0 0 0 0 1 1 0 0 0 0 ...
$ Tcombin : num 0 0 0 0 1 0 1 0 0 0 ...
$ Teffect : num 0 0 0 0 1 0 0 1 0 1 ...
$ Tmetastat : num 0 0 0 0 1 0 0 0 0 0 ...
$ Tpatient : num 0 0 0 0 1 0 1 0 1 1 ...
$ Trespons : num 0 0 0 0 0 1 0 0 0 0 ...
$ Tadvanc : num 0 0 0 0 0 0 1 0 0 0 ...
$ Tpostmenopaus : num 0 0 0 0 0 0 0 1 1 0 ...
$ Trandomis : num 0 0 0 0 0 0 0 1 1 0 ...
$ Tstudi : num 0 0 0 0 0 0 0 1 0 0 ...
$ Ttamoxifen : num 0 0 0 0 0 0 0 2 1 0 ...
$ Twomen : num 0 0 0 0 0 0 0 1 0 0 ...
$ Tadjuv : num 0 0 0 0 0 0 0 0 1 0 ...
$ Tgroup : num 0 0 0 0 0 0 0 0 1 1 ...
$ Ttherapi : num 0 0 0 0 0 0 0 0 0 0 ...
$ Tcompar : num 0 0 0 0 0 0 0 0 0 0 ...
$ Tdoxorubicin : num 0 0 0 0 0 0 0 0 0 0 ...
$ Tdocetaxel : num 0 0 0 0 0 0 0 0 0 0 ...
$ Tresult : num 0 0 0 0 0 0 0 0 0 0 ...
$ Tplus : num 0 0 0 0 0 0 0 0 0 0 ...
$ Tclinic : num 0 0 0 0 0 0 0 0 0 0 ...
$ A100 : num 0 1 0 0 0 0 0 0 0 0 ...
$ Aactiv : num 0 2 0 1 0 0 1 0 0 0 ...
$ Aassess : num 0 2 1 2 0 1 0 0 0 3 ...
$ Abreast : num 0 2 3 3 3 4 2 2 2 3 ...
$ Acarcinoma : num 0 5 0 0 0 0 0 0 0 2 ...
$ Acase : num 0 11 0 0 1 0 0 0 0 0 ...
$ Acell : num 0 3 0 0 0 1 0 0 0 0 ...
$ Aclinic : num 0 1 0 1 0 0 0 0 0 0 ...
$ Aconsist : num 0 1 0 0 0 0 0 0 0 0 ...
$ Acontrol : num 0 1 0 0 0 0 0 0 1 0 ...
$ Adiffer : num 0 1 2 1 3 0 0 1 0 1 ...
$ Adiseas : num 0 1 0 1 3 0 0 1 0 0 ...
$ Aless : num 0 4 1 0 0 0 0 0 0 6 ...
$ Anegat : num 0 4 0 0 0 3 0 0 0 0 ...
$ Aobserv : num 0 2 1 0 1 0 0 0 0 0 ...
$ Aone : num 0 1 0 0 0 0 0 0 0 0 ...
$ Apatient : num 0 1 9 5 5 6 8 3 2 5 ...
$ Apopul : num 0 8 0 0 0 0 0 0 0 0 ...
$ Aposit : num 0 2 0 1 0 5 0 0 0 0 ...
$ Aseven : num 0 1 0 0 0 0 0 0 0 0 ...
$ Ashow : num 0 2 0 0 1 0 1 0 0 3 ...
$ Astudi : num 0 1 1 1 0 1 3 2 0 1 ...
$ Atest : num 0 2 1 1 0 0 0 0 0 0 ...
$ Atherapi : num 0 2 0 1 0 0 0 0 0 0 ...
$ A500 : num 0 0 1 0 2 0 0 0 0 0 ...
$ Aaddit : num 0 0 2 0 0 0 0 0 0 0 ...
$ Aamong : num 0 0 2 4 0 0 0 0 1 0 ...
$ Aarm : num 0 0 7 4 2 0 0 1 0 1 ...
$ Aassign : num 0 0 2 1 0 0 0 1 1 1 ...
$ Aassoci : num 0 0 1 3 0 2 0 1 2 0 ...
$ Abackground : num 0 0 1 1 1 0 0 1 1 0 ...
$ Abetter : num 0 0 1 0 1 1 0 0 0 0 ...
$ Acancer : num 0 0 2 3 3 3 2 2 3 0 ...
$ Achang : num 0 0 1 0 0 0 0 0 0 0 ...
$ Achemotherapi : num 0 0 1 2 3 5 2 0 1 0 ...
$ Acompar : num 0 0 1 2 0 0 0 1 1 1 ...
$ Acomplet : num 0 0 3 0 1 1 1 2 0 0 ...
$ Aconclus : num 0 0 1 0 1 0 0 0 0 0 ...
$ Aconfid : num 0 0 1 1 0 0 0 0 0 0 ...
$ Acontinu : num 0 0 2 0 1 0 0 2 0 0 ...
$ Acycl : num 0 0 6 0 2 0 1 0 0 0 ...
$ Acyclophosphamid: num 0 0 1 1 1 1 3 0 0 0 ...
$ Aday : num 0 0 1 0 0 0 3 0 0 0 ...
$ Adecreas : num 0 0 1 0 0 0 0 0 1 0 ...
$ Adefin : num 0 0 2 0 0 0 0 0 0 0 ...
$ Ademonstr : num 0 0 1 0 0 0 0 0 0 0 ...
$ Adocetaxel : num 0 0 1 0 0 0 3 0 0 0 ...
$ Adoxorubicin : num 0 0 1 0 0 1 0 0 0 0 ...
$ Aeffect : num 0 0 2 0 0 1 0 2 0 1 ...
$ Aefficaci : num 0 0 1 0 0 0 0 0 0 0 ...
$ Aenrol : num 0 0 1 0 0 0 1 0 0 1 ...
$ Afour : num 0 0 4 0 0 0 0 0 0 0 ...
$ Ahematolog : num 0 0 1 0 0 0 0 0 0 0 ...
$ Ainiti : num 0 0 3 0 0 0 0 1 0 0 ...
$ Ainterv : num 0 0 1 1 0 0 0 0 0 0 ...
$ Aleast : num 0 0 2 0 0 0 0 0 0 0 ...
$ Alymph : num 0 0 1 4 0 3 0 0 1 0 ...
$ Amethod : num 0 0 1 0 1 0 0 1 1 0 ...
$ Amgm2 : num 0 0 5 0 4 0 9 0 0 0 ...
$ Aneoadjuv : num 0 0 2 0 0 0 0 0 0 0 ...
$ Anode : num 0 0 1 3 0 0 0 0 1 0 ...
$ Anumber : num 0 0 1 1 0 2 1 0 0 0 ...
$ Aoutcom : num 0 0 2 0 0 0 0 0 0 2 ...
$ Apatholog : num 0 0 3 0 0 1 0 0 0 0 ...
$ Aper : num 0 0 1 0 0 0 0 0 0 0 ...
$ Aprevious : num 0 0 1 0 1 0 2 0 0 0 ...
$ Arandom : num 0 0 2 1 1 0 0 1 1 2 ...
$ Arate : num 0 0 1 1 2 0 1 0 1 0 ...
[list output truncated]
367
Problem 3.3 - Building a Model
Now that we have prepared our dataframe, it’s time to split it into a training and testing set and to build regression models. Set the random seed to 144 and use the sample.split function from the caTools package to split dtm into dataframes named “train” and “test”, putting 70% of the data in the training set.
set.seed(144)
trialSplit <- sample.split(dtm$trial, 0.7)
train <- subset(dtm, trialSplit == TRUE)
test <- subset(dtm, trialSplit == FALSE)
What is the accuracy of the baseline model on the training set? (Remember that the baseline model predicts the most frequent outcome in the training set for all observations.)
table(train$trial)
0 1
730 572
730 / (730 + 572)
[1] 0.5606759
Problem 3.4 - Building a Model
Build a CART model called trialCART, using all the independent variables in the training set to train the model, and then plot the CART model. Just use the default parameters to build the model (don’t add a minbucket or cp value). Remember to add the method=“class” argument, since this is a classification problem.
trialCART <- rpart(trial ~ ., data = train, method = "class")
What is the name of the first variable the model split on?
prp(trialCART)
Tphase
Problem 3.5 - Building a Model
Obtain the training set predictions for the model (do not yet predict on the test-set). Extract the predicted probability of a result being a trial (recall that this involves not setting a type argument, and keeping only the second column of the predict output).
What is the maximum predicted probability for any result?
predTrain <- predict(trialCART)
predTrain[1:10,]
0 1
1 0.8636364 0.13636364
2 0.8636364 0.13636364
3 0.1281139 0.87188612
5 0.2176871 0.78231293
6 0.9454545 0.05454545
7 0.2176871 0.78231293
10 0.1281139 0.87188612
12 0.7125000 0.28750000
13 0.1281139 0.87188612
14 0.7125000 0.28750000
predTrainProb <- predTrain[, 2]
max(predTrainProb)
[1] 0.8718861
summary(predTrainProb)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.05455 0.13636 0.28750 0.43932 0.78231 0.87189
Problem 3.6 - Building a Model
Without running the analysis, how do you expect the maximum predicted probability to differ in the testing set? #### The maximum predicted probability will likely be exactly the same in the testing set. Because the CART tree assigns the same predicted probability to each leaf node and there are a small number of leaf nodes compared to data points, we expect exactly the same maximum predicted probability.
Problem 3.7 - Building a Model
For these questions, use a threshold probability of 0.5 to predict that an observation is a clinical trial.
table(train$trial, predTrainProb >= 0.5)
FALSE TRUE
0 631 99
1 131 441
What is the training set accuracy of the CART model?
(631 + 441) / nrow(train)
[1] 0.8233487
What is the training set sensitivity of the CART model?
441 / (441 + 131)
[1] 0.770979
What is the training set specificity of the CART model?
631 / (631 + 99)
[1] 0.8643836
Problem 4.1 - Evaluating the model on the testing set
Evaluate the CART model on the testing set using the predict function and creating a vector of predicted probabilities predTest.
pred <- predict(trialCART, newdata = test)
pred[1:10,]
0 1
4 0.1281139 0.8718861
8 0.8636364 0.1363636
9 0.7125000 0.2875000
11 0.7125000 0.2875000
19 0.7125000 0.2875000
31 0.8636364 0.1363636
40 0.8636364 0.1363636
42 0.8636364 0.1363636
43 0.8636364 0.1363636
48 0.7125000 0.2875000
predTest <- pred[, 2]
What is the testing set accuracy, assuming a probability threshold of 0.5 for predicting that a result is a clinical trial?
table(test$trial, predTest >= 0.5)
FALSE TRUE
0 261 52
1 83 162
(261 + 162) / nrow(test)
[1] 0.7580645
Problem 4.2 - Evaluating the Model on the Testing Set
Using the ROCR package, what is the testing set AUC of the prediction model?
predROCR <- prediction(predTest, test$trial)
performance(predROCR, "auc")@y.values
[[1]]
[1] 0.8371063
Problem 5.1 - Decision-Maker Tradeoffs
What is the cost associated with the model in Step 1 making a false negative prediction? #### A paper that should have been included in Set A will be missed, affecting the quality of the results of Step 3.
Problem 5.2 - Decision-Maker Tradeoffs
What is the cost associated with the model in Step 1 making a false positive prediction? #### A paper will be mistakenly added to Set A, yielding additional work in Step 2 of the process but not affecting the quality of the results of Step 3.
Problem 5.3 - Decision-Maker Tradeoffs
Given the costs associated with false positives and false negatives, which of the following is most accurate? #### A false negative is more costly than a false positive; the decision maker should use a probability threshold less than 0.5 for the machine learning model.