I would suggest an alternative approach using tidyverse functions:
- Create your example data (note: for your future postings, best would be that you actually provide example data, this way it makes it easier for the community to help you, see this link: FAQ: How to do a minimal reproducible example ( reprex ) for beginners)
data <- tibble(reference = c("US2", "L1_US24", "US2_0", "US24", "US245", "US245", "US24 L", "US3"))
- Create a dictionary of sort,
#I assume from your example that ...
#everything that contains US2 or variations thereof (excluding any digit directly after US2) is category 1
#everything that contains US24 ... is category 2
#everything that contains US245 ... is category 3
#everything that contains US3 ... is category 5
# ... you can add more of course
reference_dictionary <- tibble(short_reference = c("US2", "US24", "US245", "US3"), category = c(1, 2, 3, 5))
- In your dataframe, create column that extracts the "short_reference" (I named it
USX_reference in this case, I could have called it short_reference like in the reference_dictionary object but I wanted to show the structure should you have differing column names), then a left_join between the two dataframes and remove the USX_reference column that is not necessary any longer.
data %>% mutate(USX_reference = str_extract(reference, "US\\d+")) %>%
left_join(reference_dictionary, by = c("USX_reference" = "short_reference")) %>%
I think this is a more sustainable way to operate to add any potential new categories in your reference_dictionary object.
Hope it helps.