tidyr::separate_rows trailing separators

tidyr

#1

Is it possible to not generate the empty rows (3, 5, 8) in the below call to separate_rows? It's easy to clean up after with a filter() but I'd prefer to just not get these rows in the first place. I could imagine solving this problem with a tidyr::extract_rows, which does not yet exist. Any way to solve it with the current tidy API?

library(tidyverse)
tib <- tribble(~pi_ids, "1863574; 11844758 (contact);", "1889711;", "9107364 (contact); 1938112;")
tib %>% separate_rows(pi_ids, sep = "(\\(contact\\);|;)")
#> # A tibble: 8 x 1
#>   pi_ids      
#>   <chr>       
#> 1 1863574     
#> 2 " 11844758 "
#> 3 ""          
#> 4 1889711     
#> 5 ""          
#> 6 "9107364 "  
#> 7 " 1938112"  
#> 8 ""

Created on 2018-10-10 by the reprex package (v0.2.1)


#2

I think this is just a consequence of regex match then split. As you ask for splitting by ; and you have one at the end of each character, the result is an empty character after the split.

the internal split function behind separate_rows is stringi::stri_split_regex that have an omit_empty argument FALSE by default.

# keeping empty
stringi::stri_split_regex("1863574; 11844758 (contact);", 
                          pattern = "(\\(contact\\);|;)")
#> [[1]]
#> [1] "1863574"    " 11844758 " ""
# dropping empty
stringi::stri_split_regex("1863574; 11844758 (contact);", 
                          pattern = "(\\(contact\\);|;)", omit_empty = TRUE)
#> [[1]]
#> [1] "1863574"    " 11844758 "

Created on 2018-10-11 by the reprex package (v0.2.1)

separate_rows uses omit_empty = FALSE the default. So to use it and not having empty rows you should

  • clean before the string to split but you have several endings ((contact); or ; to remove
  • clean after with filter as you proposed
  • don't use separate_rows and do the splitting yourself changing the default

Also, if you don't want the space before or after the split results, you could add optional space in regex \\s? or trim afterward using mutate.

Hope it helps


#3

As you ask for splitting by ; and you have one at the end of each character, the result is an empty character after the split.

Yeah I guess I should have included that I also tried (\\(contact\\);$|\\(contact\\);|;$|;) in my hopes of swallowing the end-of-line and not getting that empty character after the split, but that does not seem to work. It's surprising to me that, if the separator includes $, there is still an empty character after the split.

the internal split function behind separate_rows is stringi::stri_split_regex

Thanks for pointing me to the underlying function call; when I used F2 to navigate to the tidyr::separate_rows implementation I just got UseMethod("separate_rows") and decided to give up on code navigation and just ask a question.

Unfortunately none of the three workarounds proposed are satisfactory so I guess I'll file some upstream issues.

you could add optional space in regex \\s?

Indeed I swallow whitespace in my code, but I tried to simplify the reprex to focus on the trailing separator issue.


#4

Filed https://github.com/gagolews/stringi/issues/330 and https://github.com/tidyverse/tidyr/issues/503