split character vector into tibble using regex (but remove nothing!)

mjfrigaard · April 10, 2019, 7:52pm

Hello rstats folks!

I think I've totally forgotten my string manipulation skills Does anyone know how to match on a pattern in a vector, split on the pattern, but keep the pattern used to match?

library(tidyverse)
string_vector <- "05:52 The birch canoe slid on the smooth planks.05:53 These days a chicken leg is a rare dish..06:59 Four hours of steady work faced us."
# If can find the regular expression to match the time...
stringr::str_view_all(string_vector,
                          # match number format of 00:00
                          pattern = "\\d\\d:\\d\\d")

I tried stringr::str_split() but it removes the regex...how can I split on the pattern and keep the text?

So it looks like this:

StringTibble <- tibble::tribble(
    ~time, ~text, 
    "05:52", "The birch canoe slid on the smooth planks",
    "05:53", "These days a chicken leg is a rare dish",
    "06:59", "Four hours of steady work faced us."
)
StringTibble
#> # A tibble: 3 x 2
#>   time  text                                     
#>   <chr> <chr>                                    
#> 1 05:52 The birch canoe slid on the smooth planks
#> 2 05:53 These days a chicken leg is a rare dish  
#> 3 06:59 Four hours of steady work faced us.

Thank you in advance!

^{Created on 2019-04-10 by the reprex package (v0.2.1)}

mara · April 10, 2019, 8:06pm

So, the answers I've come upon so far involve creating a dummy separator…

Or, there's a function in this blog post that splits with a type argument that allows you to keep the delimiter on the left or the right

mjfrigaard · April 10, 2019, 8:33pm

@mara Thank you so much! I hadn't seen the R-Bloggers post, but just finished reading through the SO post.

I will check it out and see if it works!

andresrcs · April 10, 2019, 9:28pm

A little help with the regex

library(stringr)
string_vector <- "05:52 The birch canoe slid on the smooth planks.05:53 These days a chicken leg is a rare dish..06:59 Four hours of steady work faced us."
str_split(string_vector, "(?<=:\\d\\d)\\s|\\.(?=\\d\\d)")
#> [[1]]
#> [1] "05:52"                                    
#> [2] "The birch canoe slid on the smooth planks"
#> [3] "05:53"                                    
#> [4] "These days a chicken leg is a rare dish." 
#> [5] "06:59"                                    
#> [6] "Four hours of steady work faced us."

mjfrigaard · April 10, 2019, 9:41pm

Ah yes, the look behinds! I am not that good with my regrex yet. Thank you!

mjfrigaard · April 10, 2019, 9:43pm

SOLUTION:

I follow the function presented in this post from @mara (thanks so much!)

The entire pipe is presented below for completeness:

strsplit_keep <- function(x,
                     split,
                     type = "remove",
                     perl = FALSE,
                     ...) {
  if (type == "remove") {
    # use base::strsplit
    out <- base::strsplit(x = x, split = split, perl = perl, ...)
  } else if (type == "before") {
    # split before the delimiter and keep it
    out <- base::strsplit(x = x,
                          split = paste0("(?<=.)(?=", split, ")"),
                          perl = TRUE,
                          ...)
  } else if (type == "after") {
    # split after the delimiter and keep it
    out <- base::strsplit(x = x,
                          split = paste0("(?<=", split, ")"),
                          perl = TRUE,
                          ...)
  } else {
    # wrong type input
    stop("type must be remove, after or before!")
  }
  return(out)
}

now this works (with some additional wrangling).

strsplit_keep(x = string_vector, 
              split = "\\d\\d:\\d\\d",
              type = "before") %>% 
    as_tibble(.name_repair = "unique") %>% 
    # rename this variable to orginal_record
    dplyr::rename(orginal_record = `...1`) %>% 
    # create a dummy variable but keep the time
    tidyr::separate(col = orginal_record,
                    into = c("time", "dummy"),
                    sep = " ",
                    remove = FALSE) %>% 
     # now remove time from orginal_record
     dplyr::mutate(description = stringr::str_remove_all(string = orginal_record,
                                        pattern = "(\\d\\d:\\d\\d)")) %>% 
    # drop dummy!
    dplyr::select(-dummy)

#> # A tibble: 3 x 3
#>   orginal_record                      time  description                    
#>   <chr>                               <chr> <chr>                          
#> 1 05:52 The birch canoe slid on the … 05:52 " The birch canoe slid on the …
#> 2 05:53 These days a chicken leg is … 05:53 " These days a chicken leg is …
#> 3 06:59 Four hours of steady work fa… 06:59 " Four hours of steady work fa…

^{Created on 2019-04-10 by the reprex package (v0.2.1)}

system · May 1, 2019, 9:43pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.