Our goal is to extract all p-values reported in these abstracts:
library(tidyverse)
df <- tibble(
abstracts = c("a p-value of 0.05",
"we found p = 0.023.",
"the p-value and the 95% CI",
"0.3 (p > 0.05) and 2.1 (p < 0.01)",
"r = 0.8 indicates high correlation",
"this part is about something else"))
Notice that most reported p-values follow this 3-part pattern:
- The mention of one of the following: p-value, p value, or p
- The sign: equals, =, <, >, or something similar (e.g. of)
- The value, which is: a number, a decimal point, then another number or numbers (e.g. 0.1 or 0.05)
In regular expressions, this pattern can be written as:
pattern <- "(p.value|p).{1,7}0\\.\\d+"
Then we can call str_extract_all() to extract all text that follow this pattern:
df <- df |> mutate(stat_sig = str_extract_all(abstracts, pattern)) df ## A tibble: 6 x 2 # abstracts stat_sig # <chr> <list> #1 a p-value of 0.05 <chr [1]> #2 we found p = 0.023. <chr [1]> #3 the p-value and the 95% CI <chr [0]> #4 0.3 (p > 0.05) and 2.1 (p<0.01) <chr [2]> #5 r = 0.8 indicates high correlation <chr [0]> #6 this part is about something else <chr [0]>
The new column is a list which we can unnest():
df <- unnest(df, stat_sig) df ## A tibble: 4 x 2 # abstracts stat_sig # <chr> <chr> #1 a p-value of 0.05 p-value of 0.05 #2 we found p = 0.023. p = 0.023 #3 0.3 (p > 0.05) and 2.1 (p<0.01) p > 0.05 #4 0.3 (p > 0.05) and 2.1 (p<0.01) p<0.01
The next step is to extract the sign and the p-value from the stat_sig column using str_extract():
df <- df |>
mutate(sign = str_extract(stat_sig, "[<>]"),
p_value = str_extract(stat_sig, "0\\.\\d+"),
p_value = as.numeric(p_value)) |>
select(-stat_sig)
df
## A tibble: 4 x 3
# abstracts sign p_value
# <chr> <chr> <dbl>
#1 a p-value of 0.05 NA 0.05
#2 we found p = 0.023. NA 0.023
#3 0.3 (p > 0.05) and 2.1 (p<0.01) > 0.05
#4 0.3 (p > 0.05) and 2.1 (p<0.01) < 0.01