Getting parts of string using regex and capturing parens in mutate

jameshowison · January 29, 2018, 10:10pm

It's often the case that I want to use regex to pull parts of strings in a chr column into their own columns. It's similar to separate but more general. I can get it to work but I find it awkward and I wonder if there's a better way?

df <- tribble(
  ~filename,
  "2008_some_name_author1.xlsx",
  "2008_some_name_author2.xlsx",
  "2008_some_name_author3.xlsx"
)

pattern <- "(\\d+).*_([^_]*).xlsx"

df %>% 
  pull(filename) %>% 
  str_match(pattern)

df %>% 
  # this is ugly: mutate(author = filename %>% str_match(pattern)[,2])
  mutate(year = filename %>% str_match(pattern) %>% as.tibble %>% pull(2),
         author = filename %>% str_match(pattern) %>% as.tibble %>% pull(3))
  # really want something like:
  # regex_separate(filename, into = c("year", "author"), pattern)

I've looked over the str_match documentation fairly closely but I haven't seen a usage example like this one. A couple of things:

Why str_match and not str_extract (hey, I think of this as "extracting bits of the string"?)
Is getting the nth column of the matrix the way to use str_match in this context? If so anything better than the two options above ([,2] and %>% as.tibble %>% pull(2))?
Any way to do "both at once" as the above code runs the regex twice.
Anyone think something like separate would be useful, using paren capturing rather than splitting? I love that separate is explicit but avoids 'magic numbers' etc. Something like:

df %>% 
   regex_separate(filename, into = c("year", "author"), pattern)

Or am I going about this entirely the wrong way?

markdly · January 29, 2018, 11:22pm

Perhaps tidyr::extract is the function you are looking for?

library(tidyverse)

df <- tribble(
  ~filename,
  "2008_some_name_author1.xlsx",
  "2008_some_name_author2.xlsx",
  "2008_some_name_author3.xlsx"
)

pattern <- "(\\d+).*_([^_]*).xlsx"

extract(df, filename, c("year", "author"), pattern)
#> # A tibble: 3 x 2
#>    year  author
#> * <chr>   <chr>
#> 1  2008 author1
#> 2  2008 author2
#> 3  2008 author3

jameshowison · January 30, 2018, 12:24am

Hah, that's it precisely. Thanks! It's kinda nice that I came up with the exact interface that existed, I reckon that's a mark of consistency in the tidyverse

I suggested adding a @seealso for tidyr:extract to the documentation for str_match: https://github.com/tidyverse/stringr/pull/212

markdly · February 1, 2018, 3:18am

You're not alone - I had the same sort of 'hiccup' before finding out extract existed too!

Good idea to suggest a @seealso. I've proposed adding one to separate as well: https://github.com/tidyverse/tidyr/pull/402