separate and regular expressions with consecutive delimeters

I have a variable formatted like "V0497-V0508" and I would like to separate it to get a column which is 497 and one which is 508. I tried using separate to no end. Can anyone help with regular expressions? I'm able to do this using str_sub because it is fixed width but would like to learn more about regular expressions and how to account for consecutive delimiters which I think is the problem here, maybe?

library(tidyverse)                                                                      
#> -- Attaching packages ------------------------------------------------------------------------------- tidyverse 1.2.1 --
#> v ggplot2 3.0.0     v purrr   0.2.5
#> v tibble  1.4.2     v dplyr   0.7.5
#> v tidyr   0.8.1     v stringr 1.3.1
#> v readr   1.1.1     v forcats 0.3.0
#> -- Conflicts ---------------------------------------------------------------------------------- tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()
testdat <- tibble(VrangeOrig=c("V0497-V0508", "V0868-V0875", "V1010-V1024"))            
                                                                                        
testdat %>%                                                                             
separate(VrangeOrig, into=c('Voriglo', 'Vorighi'), sep="[V\\-]+", remove=F, convert=T)
#> Warning: Expected 2 pieces. Additional pieces discarded in 3 rows [1, 2,
#> 3].
#> # A tibble: 3 x 3
#>   VrangeOrig  Voriglo Vorighi
#>   <chr>       <lgl>     <int>
#> 1 V0497-V0508 NA          497
#> 2 V0868-V0875 NA          868
#> 3 V1010-V1024 NA         1010
                                                                                        
testdat %>%                                                                             
mutate(                                                                                 
Voriglo=str_sub(VrangeOrig, 2, 5) %>% as.numeric,                                       
Vorighi=str_sub(VrangeOrig, 8, 11) %>% as.numeric                                       
)                                                                                       
#> # A tibble: 3 x 3
#>   VrangeOrig  Voriglo Vorighi
#>   <chr>         <dbl>   <dbl>
#> 1 V0497-V0508     497     508
#> 2 V0868-V0875     868     875
#> 3 V1010-V1024    1010    1024

So, the reason that you're getting that first empty column is because you're saying "V is one of the separators, so keep whatever's before it". I'm sure there's a way around that, but I don't happen to know it (so we'll have to wait for someone else on that).

Otherwise, you can only pass a single string (in the form of a regular expression, in this case — see SO thread here) to sep, so you can use the OR operator, |. So, here I'm saying separate on either "V" or "-V".

library(tidyverse)
testdat <- tibble(VrangeOrig = c("V0497-V0508", "V0868-V0875", "V1010-V1024"))

testdat %>%
  separate(VrangeOrig, into = c("blank", "Voriglo", "Vorighi"), 
           sep = "V|\\-V", remove = F, convert = T) %>%
  select(-blank)
#> # A tibble: 3 x 3
#>   VrangeOrig  Voriglo Vorighi
#>   <chr>         <int>   <int>
#> 1 V0497-V0508     497     508
#> 2 V0868-V0875     868     875
#> 3 V1010-V1024    1010    1024

Created on 2018-12-04 by the reprex package (v0.2.1.9000)

2 Likes

Thanks! I learned 2 things - how to specify two operators and how if a delimiter is at the beginning, a blank is made.

1 Like

That blank could be named anything. All I did was select everything but the variable blank at the end using

select(-blank)

Like I said, I'm guessing there's a way around having that first column out there — just not one I know. :slightly_frowning_face:

Would something like this work too? Using gsub to first remove the V and then separating the column, then we don't have to worry about the blank column?

testdat <- tibble(VrangeOrig = c("V0497-V0508", "V0868-V0875", "V1010-V1024"))

testdat %>%
  mutate(VrangeOrig = gsub("V", "", VrangeOrig)) %>%
  separate(VrangeOrig, sep = "-", into = c("Voriglo", "Vorighi"))
#> # A tibble: 3 x 2
#>   Voriglo Vorighi
#>   <chr>   <chr>  
#> 1 0497    0508   
#> 2 0868    0875   
#> 3 1010    1024
2 Likes

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.