Extract Numbers from Strings in R

The functions parse_integer(), parse_double(), and parse_number() from the readr library transform a character vector into a numeric vector.

  • Use parse_integer() when all characters in a string can be transformed into integers, for example: “1” and “-2”.
  • Use parse_double() when all characters in a string can be transformed into numbers, for example: “1”, “1.2”, and “1e2”.
  • Use parse_number() when you want to extract the first number from a string that contains characters other than numbers, for example: “text1”.

Here’s an example that compares these 3 functions:

library(readr)

n = c('1',
      '1.2',
      '1e2',
      '1,000',
      '1,2',
      '1/2',
      'text-1.2text',
      'text')

parse_integer(n)
#outputs: 1 NA NA NA NA NA NA NA

parse_double(n)
#outputs: 1.0 1.2 100.0 NA NA NA NA NA

parse_number(n)
#outputs: 1.0 1.2 100.0 1000.0 12.0 1.0 -1.2 NA

Exercises

1. Extract the number 1000000 from “1 000 000”

Not all characters in this string can be transformed into an integer (since we have white spaces), so we will use parse_number(), but we will have to set grouping_mark:

n1 = "1 000 000"
parse_number(n1)
#outputs: 1

parse_number(n1,
             locale = locale(grouping_mark = " "))
#outputs: 1e+06

By default grouping_mark = "," which makes parse_number(1,000) output: 1000.

2. Extract the number 123456 from “123-456”

One simple way to deal with the hyphen is to replace it with an empty string using str_replace() from the stringr library and then pass its output to parse_integer():

n1 = "123-456"
parse_number(n1) #outputs: 123

library(stringr)
n2 = str_replace(n1, "-", "")
# n2 is now: "123456"

parse_integer(n2)
#outputs: 123456

3. Extract the number 1000 from “1*10^3”

The trick here is to realize that “*10^” can be replaced with “e” which has the same effect but can be read by parse_double() or parse_number():

n1 = "1*10^3"

library(stringr)
n2 = str_replace(n, "\\*10\\^", 'e')
# n2 is now "1e3"

parse_double(n2)
#outputs: 1000

Since * and ^ have special meanings in regular expressions, we had to escape, using double backslashes \\ before each of these special characters, to match them specifically.

Further reading