Tidyr::separate() at first whitespace

I'm importing data which I need to separate into two columns. I'm having trouble to make the separator to match the first white space, because using \\s, eliminates the rest of the column. Normally I would use a split operator or remove the g flag on a regex, but here I don't know how to solve it.

So, as an example:

fruits <- data.frame(
  col = c("apples and oranges and pears and bananas", 
          "pineapples and mangos and guavas")
)

separate(fruits, col, into = c("first", "rest"), sep = "\\s")
       first rest
1     apples  and 
2 pineapples  and
Warning message:
Expected 2 pieces. Additional pieces discarded in 2 rows [1, 2]. 

I would have expected:

       first rest
1     apples  and oranges and pears and bananas
2 pineapples  and and mangos and guavas

I would replace the first space with some other character and then separate on that.

library(dplyr)
library(stringr)
library(tidyr)
fruits <- data.frame(
  col = c("apples and oranges and pears and bananas", 
          "pineapples and mangos and guavas")
)
fruits2 <- fruits %>% mutate(col = str_replace(col, "\\s", "|")) %>% 
  separate(col, into = c("first", "rest"), sep = "\\|")
fruits2$first
#> [1] "apples"     "pineapples"
fruits2$rest
#> [1] "and oranges and pears and bananas" "and mangos and guavas"

Created on 2019-03-15 by the reprex package (v0.2.1)

3 Likes

I think you're just missing extra = "merge". It merges all "leftovers" into the last column you created with into.

separate(fruits, col, into = c("first", "rest"), sep = "\\s",
         extra = "merge")

       first                              rest
1     apples and oranges and pears and bananas
2 pineapples             and mangos and guavas

(I apparently answered a very similar question on Stack Overflow in 2016. I don't remember this, so it's a good thing it came up when I searched! :rofl:)

7 Likes

I think @aosmith solution is the winner for this case, but here is another solution using extract instead of separate, just for variety sake.

library(dplyr)
library(stringr)

fruits <- data.frame(
    col = c("apples and oranges and pears and bananas", 
            "pineapples and mangos and guavas")
)

fruits %>% 
    mutate(first = str_extract(col, "^[^\\s]+"),
           rest = str_extract(col, "\\s.+")) %>% 
    select(-col)
#>        first                               rest
#> 1     apples  and oranges and pears and bananas
#> 2 pineapples              and mangos and guavas
2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.