Extract p-values from Text in R: Using separate_wider_regex

Our goal is to extract p-values from article abstracts to create 2 new variables: sign and p-value:

We will break the problem into 2 simple steps:

  • Step 1: determine which abstracts report a p-value
  • Step 2: extract these p-values

But first, let’s load the data:

library(tidyverse)

df <- tibble(
  abstracts = c("a p-value of 0.05",
                "we found p = 0.023.",
                "the p-value and the 95% CI",
                "x affected y (p<0.05)",
                "associated with a p value > 0.1",
                "r = 0.8 indicates high correlation",
                "this text is about something else"))

Step 1: determine which abstracts report a p-value

Here, we can use the function str_detect() which takes a string and a pattern, and returns TRUE if the pattern is found in the string and FALSE otherwise:

df |> 
  mutate(
    p_value = str_detect(string=abstracts,
                         pattern="(?:p.value)|(?:\\Wp\\W)")
    )

## A tibble: 7 x 2
#  abstracts                          p_value
#  <chr>                              <lgl>            
#1 a p-value of 0.05                  TRUE             
#2 we found p = 0.023.                TRUE             
#3 the p-value and the 95% CI         TRUE             
#4 x affected y (p<0.05)              TRUE             
#5 associated with a p value > 0.1    TRUE             
#6 r = 0.8 indicates high correlation FALSE            
#7 this text is about something else  FALSE 

The pattern is a regular expression that can be explained as follows:

If you are new to the subject of regular expressions, I recommend the dedicated section in the free e-book: R for Data Science (2nd edition) which can be found here.

Step 2: extract p-values

The first thing to notice is that most reported p-values follow this 3-part pattern:

  1. The mention of one of the following: p-value, p value, or p
  2. The sign: equals, =, <, >, or something similar (e.g. of)
  3. The value, which is: a number, a decimal point, then another number or numbers (e.g. 0.1 or 0.05)

In order to extract the sign and p-value from abstracts, we will use the function separate_wider_regex() which takes 2 important arguments:

  • cols: the column(s) that we want to separate into new columns. In our case, it is abstract.
  • patterns: A named character vector where the names become column names and the values are regular expressions that, when grouped, should match all the text in abstracts. We can leave some components of the vector unnamed, so they will match but will not be included in the output (as new variables). In our case, we will only name the sign and the p-value patterns.
df <- df |> 
  separate_wider_regex(cols = abstracts,
                       patterns = c(".*",
                                    "(?:p.value)|p",
                                    sign = ".?[<>]?.{0,7}",
                                    p_value = "0\\.\\d+",
                                    ".*"),
                       too_few = "align_start")

df
## A tibble: 7 x 2
#  sign       p_value
#  <chr>      <chr>  
#1 " of "     0.05   
#2 " = "      0.023  
#3 " and the" NA     
#4 "<"        0.05   
#5 " > "      0.1    
#6 ""         NA     
#7 "art is a" NA  

Here’s an explanation of the patterns used (keep in mind that, when grouped, they should match all the text in abstracts):

“.*”: match a text that starts with zero or more random characters
“(?:p.value)|p”: followed by the pattern: p and value separated by a random character, or simply the letter p
sign = “.?[<>]?.{0,7}”: followed by 1 random character if it exists, then < or > if they exist, then zero or more random characters (and save this pattern in a variable called sign)
p_value = “0\\.\\d+”: followed by a zero, then a decimal point, then one or more digits, (and save this pattern in a variable called p_value)
“.*”: followed by zero or more random characters

Notice that the column sign in the output above contains a lot of text that is not useful. Since we only want to know the sign before the p-value (< or >), we can use str_extract() to clean this column further:

df <- df |> 
  mutate(sign = str_extract(sign, '[<>]'),
         p_value = as.numeric(p_value))

df
## A tibble: 7 x 2
#  sign  p_value
#  <chr>   <dbl>
#1 NA      0.05 
#2 NA      0.023
#3 NA     NA    
#4 <       0.05 
#5 >       0.1  
#6 NA     NA    
#7 NA     NA  

Of course, the real world will be more complex than this. For instance:

  • Multiple p-values may be reported in the same abstract. The code above extracts just one. (Here’s a tutorial on extracting multiple occurrences of a pattern).
  • Sometimes the results are reported as “statistically significant” without mentioning the words “p-value” or “p”.
  • In some cases, p = 0.5 refers to a quantity p other than a p-value.

Further reading