Split text column into two parts

robinsones · January 21, 2018, 8:27pm

I have a two column dataset from a survey, the first column the question and the second the response. Many of the question names represent the two parts of the question - the main topic of the question (in the reprex, "JobFactor" and "WorkChallenge") and then the specific job factor or work challenge that question is asking (e.g. "Diversity", "Remote", etc.).

I'd like to split those two parts into two different columns (see desired_df). In good news, all the words in the question are noted with a capital; on the other hand, the word length of the two parts varies, so I can't just extract the first two words from each.

library(tibble)
#> Warning: package 'tibble' was built under R version 3.4.1
original_df <- tribble(
  ~variable, ~response,
  "JobFactorDiversity", "Important", 
  "JobFactorRemote", "Not Important", 
  "JobFactorDiversity", "Not Important", 
  "WorkChallengeFrequency", "Frequent", 
  "WorkChallengeDirtyData", "Rarely" 
)

desired_df <- tribble( 
  ~factor, ~aspect, ~response, 
  "JobFactor", "Diversity", "Important", 
  "JobFactor", "Remote", "Not Important", 
  "JobFactor", "Diversity", "Not Important", 
  "WorkChallenge", "Frequency", "Frequent", 
  "WorkChallenge", "DirtyData", "Rarely" 
)

jennybryan · January 21, 2018, 9:18pm

One way is to make it approachable via tidyr::separate(). This depends on you knowing all possible values for your factor variable.

library(tidyverse)

original_df <- tribble(
  ~variable, ~response,
  "JobFactorDiversity", "Important", 
  "JobFactorRemote", "Not Important", 
  "JobFactorDiversity", "Not Important", 
  "WorkChallengeFrequency", "Frequent", 
  "WorkChallengeDirtyData", "Rarely" 
)

prefixes <- c("JobFactor", "WorkChallenge")
prefixes <- paste(prefixes, collapse = "|")
prefixes <- paste0("(", prefixes, ")")

original_df %>%
  mutate(variable = sub(prefixes, "\\1|", variable)) %>% 
  separate(variable, into = c("factor", "aspect"))
#> # A tibble: 5 x 3
#>   factor        aspect    response     
#> * <chr>         <chr>     <chr>        
#> 1 JobFactor     Diversity Important    
#> 2 JobFactor     Remote    Not Important
#> 3 JobFactor     Diversity Not Important
#> 4 WorkChallenge Frequency Frequent     
#> 5 WorkChallenge DirtyData Rarely

Created on 2018-01-21 by the reprex package (v0.1.1.9000).

danr · January 21, 2018, 9:33pm

Here is an alternative approach that uses a bit more of the tidyverse

suppressPackageStartupMessages(library(tidyverse))

original_df <- tribble(
    ~variable, ~response,
    "JobFactorDiversity", "Important", 
    "JobFactorRemote", "Not Important", 
    "JobFactorDiversity", "Not Important", 
    "WorkChallengeFrequency", "Frequent", 
    "WorkChallengeDirtyData", "Rarely" 
)

desired_df <- tribble( 
    ~factor, ~aspect, ~response, 
    "JobFactor", "Diversity", "Important", 
    "JobFactor", "Remote", "Not Important", 
    "JobFactor", "Diversity", "Not Important", 
    # I'm guessing that your desired_df had a type
    # Expectations should be Frequency
    "WorkChallenge", "Frequency", "Frequent", 
    "WorkChallenge", "DirtyData", "Rarely" 
)

aspects <- "JobFactor|WorkChallenge"
responses <- glue::glue("({aspects})(.*)")

t2 <- dplyr::transmute(
    original_df, factor = 
        purrr::map_chr(variable, ~ stringr::str_extract(., aspects)),
    aspect = 
        purrr::map_chr(variable, ~ stringr::str_match(., responses)[[3]]),
 response = response)
identical(desired_df, t2)
#> [1] TRUE

robinsones · January 22, 2018, 3:06am

Thanks Jenny! Dan's is a bit more robust as this way actually fails when the aspect has a "" in it (which is does for one case, where the aspect is "A/B"). But always good to be reminded of tidyr::separate.