Tidyr::extract dealing with optional substring

Dong · September 4, 2018, 11:53pm

Hi,

I am trying to use extract to handle optional substrings. For example,

library(tidyverse)

df <- as.tibble(c(
  "A = X0 -> X1",
  "B = Y1" ))

df %>% extract(value, into = c("variable", "v0", "v1"),
                regex = "(\\w+) = (\\w+) -> (\\w+)")
#> # A tibble: 2 x 3
#>   variable v0    v1   
#>   <chr>    <chr> <chr>
#> 1 A        X0    X1   
#> 2 <NA>     <NA>  <NA>

Created on 2018-09-04 by the reprex package (v0.2.0).

I want to be able to match the second row and extract vlaue "B" and "Y1" to "variable" and "v1", and leave v0 empty (NA).

This is probably a general regex question beyond my skill level. Please share your suggestions/solutions.

Thanks,

Dong

markdly · September 5, 2018, 12:56am

Hi Dong,

For me, I'd tackle this problem by using separate() a couple of times instead of extract().

library(tidyverse)

df <- as.tibble(c(
  "A = X0 -> X1",
  "B = Y1" ))

df %>% 
  separate(value, into = c("variable", "rhs"), " = ") %>% 
  separate(rhs, into = c("v0", "v1"), " -> ", fill = "left")
#> # A tibble: 2 x 3
#>   variable v0    v1   
#>   <chr>    <chr> <chr>
#> 1 A        X0    X1   
#> 2 B        <NA>  Y1

Created on 2018-09-05 by the reprex package (v0.2.0).

Personally I prefer this slightly more verbose approach as it allows me to keep the regexes simpler too (as my regex skills certainly aren't the greatest)!

Dong · September 5, 2018, 5:18am

Wow! Nice and simple. Many thanks!
I did not know fill = "left" before, which appears to be the key here.

Still wondering how regex would handle it...

cderv · September 5, 2018, 7:24pm

Yes you can do that with a regex:

library(tidyverse)

df <- as.tibble(c(
  "A = X0 -> X1",
  "B = Y1" ))

df %>% extract(value, into = c("variable", "v0", "v1"),
               regex = "(\\w+) = (\\w+)(?: -> (\\w+))?")
#> # A tibble: 2 x 3
#>   variable v0    v1   
#>   <chr>    <chr> <chr>
#> 1 A        X0    X1   
#> 2 B        Y1    <NA>

Created on 2018-09-05 by the reprex package (v0.2.0).

The two additional trick I used:

(?: ...) is for a group that is not matched in extraction
(...)? is for making a group optional
That way (?: -> (\\w+))? matches only if it exists any word after ->

markdly · September 5, 2018, 11:34pm

Nice example @cderv! I adjusted your regex slightly so the optional group (?: ...)? includes the (\\w+) term before -> rather than after. This puts NA in the v0 column and Y1 in the V1 column (which I think is the desired output):

library(tidyverse)

df <- as.tibble(c(
  "A = X0 -> X1",
  "B = Y1" ))

df %>% extract(value, into = c("variable", "v0", "v1"),
               regex = "(\\w+) = (?:(\\w+) -> )?(\\w+)")
#> # A tibble: 2 x 3
#>   variable v0    v1   
#>   <chr>    <chr> <chr>
#> 1 A        X0    X1   
#> 2 B        <NA>  Y1

Created on 2018-09-06 by the reprex package (v0.2.0).

(BTW, I'm not really familiar with using option groups so I learnt a lot by trying to tweak your regex example - thanks for putting up as a solution!)

Dong · September 5, 2018, 11:38pm

Thank you both @markdly and @cderv. I certainly learned quite a bit from you.

cderv · September 6, 2018, 5:38am

Oh you're right, it is the first part that is missing ! Good catch !

Regex is powerful and there several solution to achieve one extraction. It depends on how generic or specific the regex should be here

Glad I could help !

cderv · September 6, 2018, 5:38am

If your question's been answered, would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it: