Creating categorical variable using multiple columns and conditions (start with)

I'm a little stuck here, and I would really appreciate any help to solve this!

I have a data set with three different columns/variables that would like to utilize to create a new categorical variable (preferably using tidyverse or dplyr):

code1 <- c("E1003", "30024", "E41202", "60034")
code2 <- c("X3323", "A1234", "7972", "5555")
code3 <- c("Z2232", "A1234", "E41202", "9999")

df <- data.frame(code1, code2, code3)

For any of code1, code2, or code3, I would like to create a new categorical variable (cat_var), based on the following conditions
cat_var = "group 1" if code1 or code2 or code3 start with "E100" or "A123" or "99"
cat_var="group 2"if code1 or code2 or code3 start with "79" or "E41" or "300"
cat_var="group 3"if code1 or code2 or code3 start with "Z2" or "55" or "X33"

I tried the following script below, but it didn't work:

all_columns = c("code1 ", "code2 ", "code3")

df_new <- df %>%
mutate(cat_var=case_when((starts_with(all_columns , c("E100", "A123", "99"))) ~ "group 1",
                 (starts_with(all_columns , c( "79", "E41", "300"))) ~ "group 2",
                 (starts_with(all_columns , c( "Z2", "55", "X33"))) ~ "group 3"))

I get the following error:
"Error in mutate():
! Problem while computing cat_var = case_when(...).
Caused by error in peek_vars():
! starts_with() must be used within a selecting function.
i See https://tidyselect.r-lib.org/reference/faq-selection-context.html.
Run rlang::last_error() to see where the error occurred."

Thanks in advance!

Is that what you tried to build?

code1 <- c("E1003", "30024", "E41202", "60034")
code2 <- c("X3323", "A1234", "7972", "5555")
code3 <- c("Z2232", "A1234", "E41202", "9999")

df <- data.frame(code1, code2, code3)


df %>% pivot_longer(everything()) %>% 
  mutate(cat_var=ifelse(
    value %in% c("E1003","A1234","9999"),"group1",if_else(
      value %in% c("7972 E41202","30024"),"group2","group3"
    )
  ))

image

This is more of a what problem than a how problem.

code1 <- c("E1003", "30024", "E41202", "60034")
code2 <- c("X3323", "A1234", "7972", "5555")
code3 <- c("Z2232", "A1234", "E41202", "9999")

DF <- data.frame(code1, code2, code3)
DF
#>    code1 code2  code3
#> 1  E1003 X3323  Z2232
#> 2  30024 A1234  A1234
#> 3 E41202  7972 E41202
#> 4  60034  5555   9999

# regex matching rules
group1 <-  "^E100|^A123|^99"
group2 <-  "^79|^E41|^300"
group3 <-  "^Z2|^55|^33"

DF[1,] matches group1 and group3
DF[2,] matches group2
DF[3,] matches group1 and group3

Hi RYann,

In your script, you're specifying the exact codes that you'd like to use for grouping. However, I need to extract any observations that start with a particular code. For example, instead of "E1003", I need to extract records that START with "E10". Can your script be modified to extract records that start with particular characters?

Thanks!

Sorry, I thought you were lazy to fill the entire code:)

Just use what I did with the other comment as regex. Gresp everything that starts with whatever you need ("^E100").
I am not near my laptop so can't really test and run the command again but it should work just fine using the regex. You do not need the 'starts_with()' function for this one, as it is normally used on column names, and not values within a column, if I remember correctly

Thank you! Running into a problem though... I get the following error message when I run the script below:
image

You probably misplaced a bracket. R is all about finding your missing, or extra, brackets/commas. The code above should meet your request and works just fine.

# library(tidyverse)
code1 <- c("E1003", "30024", "E41202", "60034")
code2 <- c("X3323", "A1234", "7972", "5555")
code3 <- c("Z2232", "A1234", "E41202", "9999")

df <- data.frame(code1, code2, code3)


df %>% pivot_longer(everything()) %>% 
  mutate(cat_var=ifelse(
    value %in% c("^E100","^A123","^99"),"group1",if_else(
      value %in% c("^79", "^E41","^300"),"group2","group3"
    )
  ))

The idea behind it is creating this new variable (cat_var) using two conditions (3rd one is not necessary but of course you can add it as a new 'if_else()'.
You tell R to look for the values included in the variable 'value' (that was created in the 'pivot_longer()' command) using regex and convert them into groups (1 2 3).
The regex ^ states "look for it at the beginning of the string". there is a great regex cheatsheet in stringr package documentation.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.