Split a string in a dataframe column

Andrew · May 4, 2018, 12:04pm

I have a data frame with a column that contains strings with sections separated by underscores. I'd like to create two new columns, with the section after the last underscore in one column and the rest in another. The only reliable way I've found is to use rowwise() as below:


suppressMessages(library(tidyverse))

df <- tibble(
    x = c(
        "a_b",
        "a_b_c",
        "a_b_c_d",
        "a_b_l",
        "a_b_q_r_s_h"
    )
)

df %>%
    mutate(
        splits = strsplit(x, "_")
    ) %>% 
    rowwise() %>% 
    mutate(
        last_value = splits[length(splits)],
        the_rest = paste(splits[1:length(splits)-1], collapse = "_")
    )
#> Source: local data frame [5 x 4]
#> Groups: <by row>
#> 
#> # A tibble: 5 x 4
#>   x           splits    last_value the_rest 
#>   <chr>       <list>    <chr>      <chr>    
#> 1 a_b         <chr [2]> b          a        
#> 2 a_b_c       <chr [3]> c          a_b      
#> 3 a_b_c_d     <chr [4]> d          a_b_c    
#> 4 a_b_l       <chr [3]> l          a_b      
#> 5 a_b_q_r_s_h <chr [6]> h          a_b_q_r_s

In practice I have to do this on 8 - 10 million rows of data, so using rowwise() becomes incredibly slow. I've not been able to find a reliable vectorized solution, so any ideas would be greatly appreciated!

Andrew · May 4, 2018, 12:46pm

I actually managed to speed things up as I have a lot of duplicates in the column I'm trying to split. Doing

group_by(x) %>% nest()

before

rowwise()

reduces the number of rows massively

I'm not going to mark as solved, as it doesn't answer my more general question as to whether there is a reliable vectorized solution.

Tazinho · May 4, 2018, 1:24pm

Within the tidyverse you can approach this problem via stringr::str_split() or tidyr::separate(). (I chose the latter, because it's the obvious high level solution).
It is also advantageous to know a little bit about regular expressions, specifically look aheads, see here for example. Also the stringr cheatsheet is a good place to get an overview.

I modified your test data a bit since it is not clear to me, if the length of your substrings is always one, if it is possible that no underscore will appear in your data or in case of one underscore one side might be "empty".

The regex used in the following solution matches every last underscore in a string that might be followed by a group of characters which doesn't contain an underscore.

Note that a warning is thrown, because no underscore is matched in the first row and hence there is nothing to separate. Also note that in case of one underscore in connection with an "empty side" around the underscore you get an empty string "" in the regarding new column.

suppressMessages(library(tidyverse))

df <- tibble(
  x = c(
    "a",
    "a_",
    "_a",
    "a_b",
    "a_b_c",
    "a_bk_c_d",
    "ak_b_l",
    "a_b_q_r_s_hk"
  )
)

df %>% separate(x, "_(?=[^_]*$)",
                into = c("the_rest", "last_value"), 
                remove = FALSE)
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [1].
#> # A tibble: 8 x 3
#>   x            the_rest  last_value
#>   <chr>        <chr>     <chr>     
#> 1 a            a         <NA>      
#> 2 a_           a         ""        
#> 3 _a           ""        a         
#> 4 a_b          a         b         
#> 5 a_b_c        a_b       c         
#> 6 a_bk_c_d     a_bk_c    d         
#> 7 ak_b_l       ak_b      l         
#> 8 a_b_q_r_s_hk a_b_q_r_s hk

Created on 2018-05-04 by the reprex package (v0.2.0).

Andrew · May 4, 2018, 2:35pm

Thanks! That's exactly what I'm looking for!