It’s often the case that I want to use regex to pull parts of strings in a chr column into their own columns. It’s similar to
separate but more general. I can get it to work but I find it awkward and I wonder if there’s a better way?
df <- tribble( ~filename, "2008_some_name_author1.xlsx", "2008_some_name_author2.xlsx", "2008_some_name_author3.xlsx" ) pattern <- "(\\d+).*_([^_]*).xlsx" df %>% pull(filename) %>% str_match(pattern) df %>% # this is ugly: mutate(author = filename %>% str_match(pattern)[,2]) mutate(year = filename %>% str_match(pattern) %>% as.tibble %>% pull(2), author = filename %>% str_match(pattern) %>% as.tibble %>% pull(3)) # really want something like: # regex_separate(filename, into = c("year", "author"), pattern)
I’ve looked over the str_match documentation fairly closely but I haven’t seen a usage example like this one. A couple of things:
- Why str_match and not str_extract (hey, I think of this as “extracting bits of the string”?)
- Is getting the nth column of the matrix the way to use
str_matchin this context? If so anything better than the two options above (
%>% as.tibble %>% pull(2))?
- Any way to do “both at once” as the above code runs the regex twice.
- Anyone think something like
separatewould be useful, using paren capturing rather than splitting? I love that separate is explicit but avoids ‘magic numbers’ etc. Something like:
df %>% regex_separate(filename, into = c("year", "author"), pattern)
Or am I going about this entirely the wrong way?