Download and Analyze PubMed Articles in R (Example)

In this article, we will use R to:

  1. Get PMID numbers of relevant articles
  2. Download the articles from PubMed
  3. Extract abstracts from the downloaded articles
  4. Plot the most common words in these abstracts

1. Get PMID numbers of relevant articles

Let’s say we are interested in analyzing articles about lung cancer, published in PLOS Medicine, in the year 2022.

Here’s the PubMed link that contains PMID (PubMed ID) numbers for these articles:

PubMed link containing PMID numbers of articles in which we are interested med[journal]+AND+lung+cancer+AND+2022[pdat]

If you visit this URL, you see an XML file that contains the following (among other things):


These are the PMID numbers that we need to download the articles.

So we are going to write an R script that automatically extracts these numbers:

# link to the PubMed page that contains the PMIDs of our articles
path = ' med[journal]+AND+lung+cancer+AND+2022[pdat]'

# download the page
download.file(path, 'page.xml', mode="wb")


# read the downloaded file and extract PMIDs
xml_page <- xmlParse('page.xml')
xml_root <- xmlRoot(xml_page)
xml_values <- xmlSApply(xml_root, function(x) xmlSApply(x, xmlValue))
PMIDs <- xml_values$IdList

#        Id         Id         Id         Id         Id         Id 
#"36264841" "35727802" "35613184" "35452448" "35113855" "35025879" 

2. Download the articles from PubMed

Next, we will use the library rentrez to download the articles using their PMID numbers:


# download articles from PubMed
fetch_pubmed <- entrez_fetch(db = "pubmed",
                             id = PMIDs,
                             rettype = "xml",
                             parsed = TRUE)

3. Extract abstracts from the downloaded articles

# extract abstracts
abstracts <- xpathApply(fetch_pubmed,
                       function(x) xmlValue(xmlChildren(x)$Abstract))

abstracts <-, abstracts)
names(abstracts) <- 'text'

#  text
#1 Longer time intervals to diagnosis and treatment ar...
#2 Real-world evaluation of the safety profile of vacc...
#3 Direct oral anticoagulants (DOACs) have comparable ...
#4 Taller adult height is associated with lower risks ...
#5 Epidemiological studies have reported conflicting f...
#6 Evidence suggests that chronic obstructive pulmonar...

Now that we have a data frame that contains the abstracts, let’s analyze them.

4. Plot the most common words in these abstracts

We will start by looking at the most common words in these abstracts:


# word frequency in abstracts
word_count <- abstracts |> 
  unnest_tokens(word, text) |> 
  count(word, sort = TRUE)

word_count |> 
#     word   n
#1      of 106
#2     and  91
#3     the  86
#4      to  61
#5      in  50
#6    with  49
#7    were  39
#8     for  34
#9  cancer  31
#10      a  26

We notice that most of these are “stop words” (such as: of, and, the, etc.), so we need to remove them:

# removing stop words
top10_words <- word_count |> 
  filter(!word %in% stop_words$word) |> 

# plotting the top 10 words in the abstracts

ggplot(data = top10_words,
       aes(y = reorder(word, n),
           x = n)) +


bar plot of top 10 words in the abstracts of our PubMed articles

It is no surprise that the most common word in our abstracts is “cancer”, but the words “risk” and “stroke” are interesting since lung cancer increases the risk of stroke.

Next we have “95” and “ci” (which come from 95% confidence intervals). These are not words nor they are meaningful, and so can be skipped by telling R to remove numbers and acronyms from our data. But I will leave it here for now.

For more information about the PubMed API and its usage guidelines, see:

Further reading