Our goal is to extract all p-values reported in these *abstracts*:

library(tidyverse)
df <- tibble(
abstracts = c("a p-value of 0.05",
"we found p = 0.023.",
"the p-value and the 95% CI",
"0.3 (p > 0.05) and 2.1 (p < 0.01)",
"r = 0.8 indicates high correlation",
"this part is about something else"))

Notice that most reported p-values follow this 3-part pattern:

- The mention of one of the following: p-value, p value, or p
- The sign: equals, =, <, >, or something similar (e.g. of)
- The value, which is: a number, a decimal point, then another number or numbers (e.g. 0.1 or 0.05)

In regular expressions, this pattern can be written as:

pattern <- "(p.value|p).{1,7}0\\.\\d+"

Then we can call `str_extract_all()`

to extract all text that follow this pattern:

df <- df |>
mutate(stat_sig = str_extract_all(abstracts, pattern))
df
## A tibble: 6 x 2
# abstracts stat_sig
# <chr> <list>
#1 a p-value of 0.05 <chr [1]>
#2 we found p = 0.023. <chr [1]>
#3 the p-value and the 95% CI <chr [0]>
#4 0.3 (p > 0.05) and 2.1 (p<0.01) <chr [2]>
#5 r = 0.8 indicates high correlation <chr [0]>
#6 this part is about something else <chr [0]>

The new column is a list which we can `unnest()`

:

df <- unnest(df, stat_sig)
df
## A tibble: 4 x 2
# abstracts stat_sig
# <chr> <chr>
#1 a p-value of 0.05 p-value of 0.05
#2 we found p = 0.023. p = 0.023
#3 0.3 (p > 0.05) and 2.1 (p<0.01) p > 0.05
#4 0.3 (p > 0.05) and 2.1 (p<0.01) p<0.01

The next step is to extract the sign and the p-value from the *stat_sig* column using `str_extract()`

:

df <- df |>
mutate(sign = str_extract(stat_sig, "[<>]"),
p_value = str_extract(stat_sig, "0\\.\\d+"),
p_value = as.numeric(p_value)) |>
select(-stat_sig)
df
## A tibble: 4 x 3
# abstracts sign p_value
# <chr> <chr> <dbl>
#1 a p-value of 0.05 NA 0.05
#2 we found p = 0.023. NA 0.023
#3 0.3 (p > 0.05) and 2.1 (p<0.01) > 0.05
#4 0.3 (p > 0.05) and 2.1 (p<0.01) < 0.01

## Further reading