Our goal is to extract all p-values reported in these abstracts:
library(tidyverse) df <- tibble( abstracts = c("a p-value of 0.05", "we found p = 0.023.", "the p-value and the 95% CI", "0.3 (p > 0.05) and 2.1 (p < 0.01)", "r = 0.8 indicates high correlation", "this part is about something else"))
Notice that most reported p-values follow this 3-part pattern:
- The mention of one of the following: p-value, p value, or p
- The sign: equals, =, <, >, or something similar (e.g. of)
- The value, which is: a number, a decimal point, then another number or numbers (e.g. 0.1 or 0.05)
In regular expressions, this pattern can be written as:
pattern <- "(p.value|p).{1,7}0\\.\\d+"
Then we can call str_extract_all()
to extract all text that follow this pattern:
df <- df |> mutate(stat_sig = str_extract_all(abstracts, pattern)) df ## A tibble: 6 x 2 # abstracts stat_sig # <chr> <list> #1 a p-value of 0.05 <chr [1]> #2 we found p = 0.023. <chr [1]> #3 the p-value and the 95% CI <chr [0]> #4 0.3 (p > 0.05) and 2.1 (p<0.01) <chr [2]> #5 r = 0.8 indicates high correlation <chr [0]> #6 this part is about something else <chr [0]>
The new column is a list which we can unnest()
:
df <- unnest(df, stat_sig) df ## A tibble: 4 x 2 # abstracts stat_sig # <chr> <chr> #1 a p-value of 0.05 p-value of 0.05 #2 we found p = 0.023. p = 0.023 #3 0.3 (p > 0.05) and 2.1 (p<0.01) p > 0.05 #4 0.3 (p > 0.05) and 2.1 (p<0.01) p<0.01
The next step is to extract the sign and the p-value from the stat_sig column using str_extract()
:
df <- df |> mutate(sign = str_extract(stat_sig, "[<>]"), p_value = str_extract(stat_sig, "0\\.\\d+"), p_value = as.numeric(p_value)) |> select(-stat_sig) df ## A tibble: 4 x 3 # abstracts sign p_value # <chr> <chr> <dbl> #1 a p-value of 0.05 NA 0.05 #2 we found p = 0.023. NA 0.023 #3 0.3 (p > 0.05) and 2.1 (p<0.01) > 0.05 #4 0.3 (p > 0.05) and 2.1 (p<0.01) < 0.01