Tidyr::extract dealing with optional substring

stringr
regex

#1

Hi,

I am trying to use extract to handle optional substrings. For example,

library(tidyverse)

df <- as.tibble(c(
  "A = X0 -> X1",
  "B = Y1" ))

df %>% extract(value, into = c("variable", "v0", "v1"),
                regex = "(\\w+) = (\\w+) -> (\\w+)")
#> # A tibble: 2 x 3
#>   variable v0    v1   
#>   <chr>    <chr> <chr>
#> 1 A        X0    X1   
#> 2 <NA>     <NA>  <NA>

Created on 2018-09-04 by the reprex package (v0.2.0).

I want to be able to match the second row and extract vlaue "B" and "Y1" to "variable" and "v1", and leave v0 empty (NA).

This is probably a general regex question beyond my skill level. Please share your suggestions/solutions.

Thanks,

Dong


#2

Hi Dong,

For me, I'd tackle this problem by using separate() a couple of times instead of extract().

library(tidyverse)

df <- as.tibble(c(
  "A = X0 -> X1",
  "B = Y1" ))

df %>% 
  separate(value, into = c("variable", "rhs"), " = ") %>% 
  separate(rhs, into = c("v0", "v1"), " -> ", fill = "left")
#> # A tibble: 2 x 3
#>   variable v0    v1   
#>   <chr>    <chr> <chr>
#> 1 A        X0    X1   
#> 2 B        <NA>  Y1

Created on 2018-09-05 by the reprex package (v0.2.0).

Personally I prefer this slightly more verbose approach as it allows me to keep the regexes simpler too (as my regex skills certainly aren't the greatest)! :slight_smile:


#3

Wow! Nice and simple. Many thanks!
I did not know fill = "left" before, which appears to be the key here.

Still wondering how regex would handle it...


#4

Yes you can do that with a regex:

library(tidyverse)

df <- as.tibble(c(
  "A = X0 -> X1",
  "B = Y1" ))

df %>% extract(value, into = c("variable", "v0", "v1"),
               regex = "(\\w+) = (\\w+)(?: -> (\\w+))?")
#> # A tibble: 2 x 3
#>   variable v0    v1   
#>   <chr>    <chr> <chr>
#> 1 A        X0    X1   
#> 2 B        Y1    <NA>

Created on 2018-09-05 by the reprex package (v0.2.0).

The two additional trick I used:

  • (?: ...) is for a group that is not matched in extraction
  • (...)? is for making a group optional
    That way (?: -> (\\w+))? matches only if it exists any word after ->

#5

Nice example @cderv! I adjusted your regex slightly so the optional group (?: ...)? includes the (\\w+) term before -> rather than after. This puts NA in the v0 column and Y1 in the V1 column (which I think is the desired output):

library(tidyverse)

df <- as.tibble(c(
  "A = X0 -> X1",
  "B = Y1" ))

df %>% extract(value, into = c("variable", "v0", "v1"),
               regex = "(\\w+) = (?:(\\w+) -> )?(\\w+)")
#> # A tibble: 2 x 3
#>   variable v0    v1   
#>   <chr>    <chr> <chr>
#> 1 A        X0    X1   
#> 2 B        <NA>  Y1

Created on 2018-09-06 by the reprex package (v0.2.0).

(BTW, I'm not really familiar with using option groups so I learnt a lot by trying to tweak your regex example - thanks for putting up as a solution!)


#6

Thank you both @markdly and @cderv. I certainly learned quite a bit from you.


#7

Oh you're right, it is the first part that is missing ! :slight_smile: Good catch !

Regex is powerful and there several solution to achieve one extraction. It depends on how generic or specific the regex should be here

Glad I could help !


#8

If your question's been answered, would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it: