separate(): more than one "-"

I would like to use the separate() or any other function to generate two new variables, where one variable would contain company names and another would contain info names (Sales or Profit). Notice that a company name might contain one or more -. Thanks in advance!

name_item
Apple Inc-Sales
Orange-Inc-Sales
Apple-Orange-Inc-Sales
Apple-Orange-Mango-Inc-Sales
Apple Inc-Profit
Orange-Inc-Profit
Apple-Orange-Inc-Profit
Apple-Orange-Mango-Inc-Profit
library(tidyverse)
toy_data <- tibble(
  name_item = c(
    "Apple Inc-Sales", "Orange-Inc-Sales", "Apple-Orange-Inc-Sales", "Apple-Orange-Mango-Inc-Sales",
    "Apple Inc-Profit", "Orange-Inc-Profit", "Apple-Orange-Inc-Profit", "Apple-Orange-Mango-Inc-Profit"
  )
) 

Using a regular expression with a look-ahead condition seems to work.

library(tidyr)

toy_data <- tibble(
  name_item = c(
    "Apple Inc-Sales", "Orange-Inc-Sales", "Apple-Orange-Inc-Sales", "Apple-Orange-Mango-Inc-Sales",
    "Apple Inc-Profit", "Orange-Inc-Profit", "Apple-Orange-Inc-Profit", "Apple-Orange-Mango-Inc-Profit"
  )
)

toy_data <- separate(toy_data, col = "name_item", into = c("Company", "Type"), sep = "-(?=[^-]+$)")
toy_data
#> # A tibble: 8 x 2
#>   Company                Type  
#>   <chr>                  <chr> 
#> 1 Apple Inc              Sales 
#> 2 Orange-Inc             Sales 
#> 3 Apple-Orange-Inc       Sales 
#> 4 Apple-Orange-Mango-Inc Sales 
#> 5 Apple Inc              Profit
#> 6 Orange-Inc             Profit
#> 7 Apple-Orange-Inc       Profit
#> 8 Apple-Orange-Mango-Inc Profit

Created on 2020-08-05 by the reprex package (v0.3.0)

1 Like

Thanks a lot @FJCC! Could you please help me to understand this part : sep = "-(?=[^-]+$)"? Beginner here.

The sep argument determines what character is used to separate the input into pieces. In this case, we want to use the last hyphen. I defined the last hyphen as "a hyphen after which there are no more hyphens". To understand the regular expression I used to represent that, I will work from the inside out.
[^-] means any character that is not a hyphen
[^-]+ means one or more of any character that is not a hyphen
The $ represents the end of the paragraph or input so
[^-]+$ means one or more of any character that is not a hyphen followed by the end of the input
The structure (?= ) represents text that follows whatever text the search engine is looking at but it does not count as part of what has been found. It is a look-ahead assertion. The entire regular expression
-(?=[^-]+$) means a hyphen followed by one or more of any character that is not a hyphen followed by the end of the input
That is a somewhat complicated way to say "the last hyphen".

1 Like

Many thanks for the detailed explanation!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.