separate and regular expressions with consecutive delimeters


#1

I have a variable formatted like "V0497-V0508" and I would like to separate it to get a column which is 497 and one which is 508. I tried using separate to no end. Can anyone help with regular expressions? I'm able to do this using str_sub because it is fixed width but would like to learn more about regular expressions and how to account for consecutive delimiters which I think is the problem here, maybe?

library(tidyverse)                                                                      
#> -- Attaching packages ------------------------------------------------------------------------------- tidyverse 1.2.1 --
#> v ggplot2 3.0.0     v purrr   0.2.5
#> v tibble  1.4.2     v dplyr   0.7.5
#> v tidyr   0.8.1     v stringr 1.3.1
#> v readr   1.1.1     v forcats 0.3.0
#> -- Conflicts ---------------------------------------------------------------------------------- tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()
testdat <- tibble(VrangeOrig=c("V0497-V0508", "V0868-V0875", "V1010-V1024"))            
                                                                                        
testdat %>%                                                                             
separate(VrangeOrig, into=c('Voriglo', 'Vorighi'), sep="[V\\-]+", remove=F, convert=T)
#> Warning: Expected 2 pieces. Additional pieces discarded in 3 rows [1, 2,
#> 3].
#> # A tibble: 3 x 3
#>   VrangeOrig  Voriglo Vorighi
#>   <chr>       <lgl>     <int>
#> 1 V0497-V0508 NA          497
#> 2 V0868-V0875 NA          868
#> 3 V1010-V1024 NA         1010
                                                                                        
testdat %>%                                                                             
mutate(                                                                                 
Voriglo=str_sub(VrangeOrig, 2, 5) %>% as.numeric,                                       
Vorighi=str_sub(VrangeOrig, 8, 11) %>% as.numeric                                       
)                                                                                       
#> # A tibble: 3 x 3
#>   VrangeOrig  Voriglo Vorighi
#>   <chr>         <dbl>   <dbl>
#> 1 V0497-V0508     497     508
#> 2 V0868-V0875     868     875
#> 3 V1010-V1024    1010    1024

#2

So, the reason that you're getting that first empty column is because you're saying "V is one of the separators, so keep whatever's before it". I'm sure there's a way around that, but I don't happen to know it (so we'll have to wait for someone else on that).

Otherwise, you can only pass a single string (in the form of a regular expression, in this case — see SO thread here) to sep, so you can use the OR operator, |. So, here I'm saying separate on either "V" or "-V".

library(tidyverse)
testdat <- tibble(VrangeOrig = c("V0497-V0508", "V0868-V0875", "V1010-V1024"))

testdat %>%
  separate(VrangeOrig, into = c("blank", "Voriglo", "Vorighi"), 
           sep = "V|\\-V", remove = F, convert = T) %>%
  select(-blank)
#> # A tibble: 3 x 3
#>   VrangeOrig  Voriglo Vorighi
#>   <chr>         <int>   <int>
#> 1 V0497-V0508     497     508
#> 2 V0868-V0875     868     875
#> 3 V1010-V1024    1010    1024

Created on 2018-12-04 by the reprex package (v0.2.1.9000)


#3

Thanks! I learned 2 things - how to specify two operators and how if a delimiter is at the beginning, a blank is made.


#4

That blank could be named anything. All I did was select everything but the variable blank at the end using

select(-blank)

Like I said, I'm guessing there's a way around having that first column out there — just not one I know. :slightly_frowning_face:


#5

Would something like this work too? Using gsub to first remove the V and then separating the column, then we don't have to worry about the blank column?

testdat <- tibble(VrangeOrig = c("V0497-V0508", "V0868-V0875", "V1010-V1024"))

testdat %>%
  mutate(VrangeOrig = gsub("V", "", VrangeOrig)) %>%
  separate(VrangeOrig, sep = "-", into = c("Voriglo", "Vorighi"))
#> # A tibble: 3 x 2
#>   Voriglo Vorighi
#>   <chr>   <chr>  
#> 1 0497    0508   
#> 2 0868    0875   
#> 3 1010    1024