split string by multiple delimiter

Hi Team,
I can use strsplit function to split a string with multiple delimiter (see the sample code below). the problem is: the results is not included the delimiter. But, I need know the result string is following which delimiter. So I want to add the delimiter into the string.

I upload a pdf file, the black color is the current results, the red result is what I want to get.

Could you please help me to figure this out?

try.pdf (46.9 KB)
Thanks,
Kai

try <- data.frame(id=c(1:3),
                  testing_str=c(
                    "keyword1 xxxxxxx keyword2 yyyyyyyy keyword3 zzzzzzzz",
                    "keyword2 keiwae;cse keyword1 z,.xcvweir keyword4 mrgksdfgirejk",
                    "keyword3 rtfg.,ertl keyword2 m,asdfieldf keyword4 klaksdiekasdf keyword1 .,;asdjhkasfd"))

out <- strsplit(try$testing_str,'keyword1|keyword2|keyword3|keyword4') 

try2 <- data.frame(try, do.call(rbind, out))

Below is one approach to achieving your desired output. All keywords are first extracted into one column, and then the string is separated into multiple columns (split by each keyword). Finally, the keywords are pasted back into each appropriate column.

library(tidyverse)

try %>%
  mutate(keywords = str_extract_all(testing_str, 'keyword1|keyword2|keyword3|keyword4')) %>%
  separate(testing_str,
           sep = 'keyword1|keyword2|keyword3|keyword4',
           into = c('X1', 'X2', 'X3', 'X4', 'X5'),
           remove = F) %>%
  rowwise() %>%
  mutate(X2 = ifelse(length(keywords) > 0, paste(keywords[[1]], X2), X2),
         X3 = ifelse(length(keywords) > 1, paste(keywords[[2]], X3), X3),
         X4 = ifelse(length(keywords) > 2, paste(keywords[[3]], X4), X4),
         X5 = ifelse(length(keywords) > 3, paste(keywords[[4]], X5), X5)
         ) %>%
  ungroup() %>%
  select(-keywords)

Hi scottyd22,
Thank you for your help. I tried your sample code. It seems the columns of X1, ... X5 are not kept in the results. I cannot find any problem of your sample code. Confuse me....
Kai

Can you please copy and share the code you're executing? I just re-ran the code I shared and ended up with the same result in the image.

Hi scottyd22,
I restart Rstudio and run the code. Here is the result form console window.
Sorry bother you,
Kai

try <- data.frame(id=c(1:3),

  •               testing_str=c(
    
  •                 "keyword1 xxxxxxx keyword2 yyyyyyyy keyword3 zzzzzzzz",
    
  •                 "keyword2 keiwae;cse keyword1 z,.xcvweir keyword4 mrgksdfgirejk",
    
  •                 "keyword3 rtfg.,ertl keyword2 m,asdfieldf keyword4 klaksdiekasdf keyword1 .,;asdjhkasfd"))
    

library(tidyverse)

try %>%

  • mutate(keywords = str_extract_all(testing_str, 'keyword1|keyword2|keyword3|keyword4'))%>%
  • separate(testing_str,
  •        sep = 'keyword1|keyword2|keyword3|keyword4',
    
  •        into = c('X1', 'X2', 'X3', 'X4', 'X5'),
    
  •        remove = F) %>%
    
  • rowwise() %>%
  • mutate(X2 = ifelse(length(keywords) > 0, paste(keywords[[1]], X2), X2),
  •      X3 = ifelse(length(keywords) > 1, paste(keywords[[2]], X3), X3),
    
  •      X4 = ifelse(length(keywords) > 2, paste(keywords[[3]], X4), X4),
    
  •      X5 = ifelse(length(keywords) > 3, paste(keywords[[4]], X5), X5)
    
  • ) %>%
  • ungroup() %>%
  • select(-keywords)

A tibble: 3 × 7

 id testing_str                                                                       X1    X2    X3    X4    X5   


1 1 keyword1 xxxxxxx keyword2 yyyyyyyy keyword3 zzzzzzzz "" "key… "key… "key… NA
2 2 keyword2 keiwae;cse keyword1 z,.xcvweir keyword4 mrgksdfgirejk "" "key… "key… "key… NA
3 3 keyword3 rtfg.,ertl keyword2 m,asdfieldf keyword4 klaksdiekasdf keyword1 .,;asdj… "" "key… "key… "key… keyw…
Warning message:
Expected 5 pieces. Missing pieces filled with NA in 2 rows [1, 2].

sorry, just saw the wrong format. repost it full R code here

try <- data.frame(id=c(1:3),
                  testing_str=c(
                    "keyword1 xxxxxxx keyword2 yyyyyyyy keyword3 zzzzzzzz",
                    "keyword2 keiwae;cse keyword1 z,.xcvweir keyword4 mrgksdfgirejk",
                    "keyword3 rtfg.,ertl keyword2 m,asdfieldf keyword4 klaksdiekasdf keyword1 .,;asdjhkasfd"))

library(tidyverse)

try %>%
  mutate(keywords = str_extract_all(testing_str, 'keyword1|keyword2|keyword3|keyword4'))%>%
  separate(testing_str,
           sep = 'keyword1|keyword2|keyword3|keyword4',
           into = c('X1', 'X2', 'X3', 'X4', 'X5'),
           remove = F) %>%
  rowwise() %>%
  mutate(X2 = ifelse(length(keywords) > 0, paste(keywords[[1]], X2), X2),
         X3 = ifelse(length(keywords) > 1, paste(keywords[[2]], X3), X3),
         X4 = ifelse(length(keywords) > 2, paste(keywords[[3]], X4), X4),
         X5 = ifelse(length(keywords) > 3, paste(keywords[[4]], X5), X5)
  ) %>%
  ungroup() %>%
  select(-keywords)

It looks like the values are there (3 x 7 tibble). Try adding View() to the end.

... %>%
select(-keywords) %>%
View()

Hi scottyd22,

I can see the result. This is what I wanted. After ran this code, I did try to put the result into a dataframe. I add "try2 <- as.data.frame(try)", at the end of the code, but it still keep the original value.

Could you please tell me how to transfer it into a dataframe?

Thank you,
Kai

Excellent! You can assign it to an object as a data frame using the following:

try2 <- try %>%
  mutate(keywords = str_extract_all(testing_str, 'keyword1|keyword2|keyword3|keyword4')) %>%
  separate(testing_str,
           sep = 'keyword1|keyword2|keyword3|keyword4',
           into = c('X1', 'X2', 'X3', 'X4', 'X5'),
           remove = F) %>%
  rowwise() %>%
  mutate(X2 = ifelse(length(keywords) > 0, paste(keywords[[1]], X2), X2),
         X3 = ifelse(length(keywords) > 1, paste(keywords[[2]], X3), X3),
         X4 = ifelse(length(keywords) > 2, paste(keywords[[3]], X4), X4),
         X5 = ifelse(length(keywords) > 3, paste(keywords[[4]], X5), X5)
  ) %>%
  ungroup() %>%
  select(-keywords) %>%
  as.data.frame()

Hi scottyd22,
Woo! It works very well !!
Thank you very much for your help!
Kai

1 Like

Instead of splitting by or extracting keywords, you can add an anchor before each of them and separate by that anchor:

try %>% 
  as_tibble() %>%
  mutate(testing_str = str_replace_all(
    testing_str,
    "(?=keyword1|keyword2|keyword3|keyword4)",
    "-_-"
  )) %>%
  separate(testing_str, sep = "-_-", into = paste0("X", 1:5)) %>% 
  mutate(across(starts_with("X"), trimws)) # to remove heading and trailing whitespace

If you don't know the number of new columns to create in advance, I'd do as follows:

k <- str_replace_all(try$testing_str, '(?=keyword1|keyword2|keyword3|keyword4)', "-_-") 
s <- strsplit(k, '-_-') 

tibble(data = s) %>% 
  unnest_wider(col = data, names_sep = "") %>% 
  # the last 2 lines are optional
  # you can use bind_cols() to add the columns from
  # the original dataset
  mutate(across(starts_with("data"), trimws)) %>% 
  rowid_to_column("id")
1 Like

Hi arangaca,
this is better solution by reduce number code. but in the final, I still want to keep the original testing_str for double checking and delete it later.
How can I do this in your sample?
Thank you,
Kai

For the first solution, simply add remove = FALSE in separate().
For the second solution replace the last line with bind_cols(try, .).

Many thanks arangaca! it works well.
Best,
Kai

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.