Our goal is to extract p-values from article *abstracts* to create 2 new variables: *sign* and *p-value*:

We will break the problem into 2 simple steps:

- Step 1: determine which abstracts report a p-value
- Step 2: extract these p-values

But first, let’s load the data:

library(tidyverse) df <- tibble( abstracts = c("a p-value of 0.05", "we found p = 0.023.", "the p-value and the 95% CI", "x affected y (p<0.05)", "associated with a p value > 0.1", "r = 0.8 indicates high correlation", "this text is about something else"))

## Step 1: determine which abstracts report a p-value

Here, we can use the function `str_detect()`

which takes a *string* and a *pattern*, and returns TRUE if the pattern is found in the string and FALSE otherwise:

df |> mutate( p_value = str_detect(string=abstracts, pattern="(?:p.value)|(?:\\Wp\\W)") ) ## A tibble: 7 x 2 # abstracts p_value # <chr> <lgl> #1 a p-value of 0.05 TRUE #2 we found p = 0.023. TRUE #3 the p-value and the 95% CI TRUE #4 x affected y (p<0.05) TRUE #5 associated with a p value > 0.1 TRUE #6 r = 0.8 indicates high correlation FALSE #7 this text is about something else FALSE

The pattern is a regular expression that can be explained as follows:

If you are new to the subject of regular expressions, I recommend the dedicated section in the free e-book: R for Data Science (2^{nd} edition) which can be found here.

## Step 2: extract p-values

The first thing to notice is that most reported p-values follow this 3-part pattern:

- The mention of one of the following: p-value, p value, or p
- The sign: equals, =, <, >, or something similar (e.g. of)
- The value, which is: a number, a decimal point, then another number or numbers (e.g. 0.1 or 0.05)

In order to extract the *sign* and *p-value* from *abstracts*, we will use the function `separate_wider_regex()`

which takes 2 important arguments:

*cols*: the column(s) that we want to separate into new columns. In our case, it is*abstract*.*patterns*: A named character vector where the names become column names and the values are regular expressions that, when grouped, should match all the text in*abstracts*. We can leave some components of the vector unnamed, so they will match but will not be included in the output (as new variables). In our case, we will only name the*sign*and the*p-value*patterns.

df <- df |> separate_wider_regex(cols = abstracts, patterns = c(".*", "(?:p.value)|p", sign = ".?[<>]?.{0,7}", p_value = "0\\.\\d+", ".*"), too_few = "align_start") df ## A tibble: 7 x 2 # sign p_value # <chr> <chr> #1 " of " 0.05 #2 " = " 0.023 #3 " and the" NA #4 "<" 0.05 #5 " > " 0.1 #6 "" NA #7 "art is a" NA

Here’s an explanation of the patterns used (keep in mind that, when grouped, they should match all the text in *abstracts*):

**“.*”:** match a text that starts with zero or more random characters**“(?:p.value)|p”:** followed by the pattern: p and value separated by a random character, or simply the letter p**sign = “.?[<>]?.{0,7}”:** followed by 1 random character if it exists, then < or > if they exist, then zero or more random characters (and save this pattern in a variable called *sign*)**p_value = “0\\.\\d+”:** followed by a zero, then a decimal point, then one or more digits, (and save this pattern in a variable called *p_value*)**“.*”:** followed by zero or more random characters

Notice that the column *sign* in the output above contains a lot of text that is not useful. Since we only want to know the sign before the p-value (< or >), we can use `str_extract()`

to clean this column further:

df <- df |> mutate(sign = str_extract(sign, '[<>]'), p_value = as.numeric(p_value)) df ## A tibble: 7 x 2 # sign p_value # <chr> <dbl> #1 NA 0.05 #2 NA 0.023 #3 NA NA #4 < 0.05 #5 > 0.1 #6 NA NA #7 NA NA

Of course, the real world will be more complex than this. For instance:

- Multiple p-values may be reported in the same abstract. The code above extracts just one. (Here’s a tutorial on extracting multiple occurrences of a pattern).
- Sometimes the results are reported as “statistically significant” without mentioning the words “p-value” or “p”.
- In some cases, p = 0.5 refers to a quantity p other than a p-value.