In this article, we will use R to:
- Get PMID numbers of relevant articles
- Download the articles from PubMed
- Extract abstracts from the downloaded articles
- Plot the most common words in these abstracts
1. Get PMID numbers of relevant articles
Let’s say we are interested in analyzing articles about lung cancer, published in PLOS Medicine, in the year 2022.
Here’s the PubMed link that contains PMID (PubMed ID) numbers for these articles:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=plos med[journal]+AND+lung+cancer+AND+2022[pdat]
If you visit this URL, you see an XML file that contains the following (among other things):
<IdList> <Id>36264841</Id> <Id>35727802</Id> <Id>35613184</Id> <Id>35452448</Id> <Id>35113855</Id> <Id>35025879</Id> </IdList>
These are the PMID numbers that we need to download the articles.
So we are going to write an R script that automatically extracts these numbers:
# link to the PubMed page that contains the PMIDs of our articles path = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=plos med[journal]+AND+lung+cancer+AND+2022[pdat]' # download the page download.file(path, 'page.xml', mode="wb") library(XML) # read the downloaded file and extract PMIDs xml_page <- xmlParse('page.xml') xml_root <- xmlRoot(xml_page) xml_values <- xmlSApply(xml_root, function(x) xmlSApply(x, xmlValue)) PMIDs <- xml_values$IdList PMIDs # Id Id Id Id Id Id #"36264841" "35727802" "35613184" "35452448" "35113855" "35025879"
2. Download the articles from PubMed
Next, we will use the library rentrez
to download the articles using their PMID numbers:
library(rentrez) # download articles from PubMed fetch_pubmed <- entrez_fetch(db = "pubmed", id = PMIDs, rettype = "xml", parsed = TRUE)
3. Extract abstracts from the downloaded articles
# extract abstracts abstracts <- xpathApply(fetch_pubmed, '//PubmedArticle//Article', function(x) xmlValue(xmlChildren(x)$Abstract)) abstracts <- do.call(rbind.data.frame, abstracts) names(abstracts) <- 'text' View(abstracts) # text #1 Longer time intervals to diagnosis and treatment ar... #2 Real-world evaluation of the safety profile of vacc... #3 Direct oral anticoagulants (DOACs) have comparable ... #4 Taller adult height is associated with lower risks ... #5 Epidemiological studies have reported conflicting f... #6 Evidence suggests that chronic obstructive pulmonar...
Now that we have a data frame that contains the abstracts, let’s analyze them.
4. Plot the most common words in these abstracts
We will start by looking at the most common words in these abstracts:
library(dplyr) library(tidytext) # word frequency in abstracts word_count <- abstracts |> unnest_tokens(word, text) |> count(word, sort = TRUE) word_count |> head(10) # word n #1 of 106 #2 and 91 #3 the 86 #4 to 61 #5 in 50 #6 with 49 #7 were 39 #8 for 34 #9 cancer 31 #10 a 26
We notice that most of these are “stop words” (such as: of, and, the, etc.), so we need to remove them:
# removing stop words top10_words <- word_count |> filter(!word %in% stop_words$word) |> head(10) # plotting the top 10 words in the abstracts library(ggplot2) theme_set(theme_light()) ggplot(data = top10_words, aes(y = reorder(word, n), x = n)) + geom_col()
Output:
It is no surprise that the most common word in our abstracts is “cancer”, but the words “risk” and “stroke” are interesting since lung cancer increases the risk of stroke.
Next we have “95” and “ci” (which come from 95% confidence intervals). These are not words nor they are meaningful, and so can be skipped by telling R to remove numbers and acronyms from our data. But I will leave it here for now.
For more information about the PubMed API and its usage guidelines, see: https://www.ncbi.nlm.nih.gov/home/develop/api/