Extract Multiple Occurrences of a Pattern in a String in R

Our goal is to extract all p-values reported in these abstracts:

library(tidyverse)

df <- tibble(
  abstracts = c("a p-value of 0.05",
                "we found p = 0.023.",
                "the p-value and the 95% CI",
                "0.3 (p > 0.05) and 2.1 (p < 0.01)",
                "r = 0.8 indicates high correlation",
                "this part is about something else"))

Notice that most reported p-values follow this 3-part pattern:

  1. The mention of one of the following: p-value, p value, or p
  2. The sign: equals, =, <, >, or something similar (e.g. of)
  3. The value, which is: a number, a decimal point, then another number or numbers (e.g. 0.1 or 0.05)

In regular expressions, this pattern can be written as:

pattern <- "(p.value|p).{1,7}0\\.\\d+"

Then we can call str_extract_all() to extract all text that follow this pattern:

df <- df |>
  mutate(stat_sig = str_extract_all(abstracts, pattern))

df
## A tibble: 6 x 2
#  abstracts                          stat_sig 
#  <chr>                              <list>   
#1 a p-value of 0.05                  <chr [1]>
#2 we found p = 0.023.                <chr [1]>
#3 the p-value and the 95% CI         <chr [0]>
#4 0.3 (p > 0.05) and 2.1 (p<0.01)    <chr [2]>
#5 r = 0.8 indicates high correlation <chr [0]>
#6 this part is about something else  <chr [0]>

The new column is a list which we can unnest():

df <- unnest(df, stat_sig)

df
## A tibble: 4 x 2
#  abstracts                       stat_sig       
#  <chr>                           <chr>          
#1 a p-value of 0.05               p-value of 0.05
#2 we found p = 0.023.             p = 0.023      
#3 0.3 (p > 0.05) and 2.1 (p<0.01) p > 0.05       
#4 0.3 (p > 0.05) and 2.1 (p<0.01) p<0.01   

The next step is to extract the sign and the p-value from the stat_sig column using str_extract():

df <- df |> 
  mutate(sign = str_extract(stat_sig, "[<>]"),
         p_value = str_extract(stat_sig, "0\\.\\d+"),
         p_value = as.numeric(p_value)) |> 
  select(-stat_sig)

df
## A tibble: 4 x 3
#  abstracts                       sign  p_value
#  <chr>                           <chr>   <dbl>
#1 a p-value of 0.05               NA      0.05 
#2 we found p = 0.023.             NA      0.023
#3 0.3 (p > 0.05) and 2.1 (p<0.01) >       0.05 
#4 0.3 (p > 0.05) and 2.1 (p<0.01) <       0.01 

Further reading