Create patterns list from df values for use in case_when

First time caller! Reprex attempted below, let me know if improperly created.

My goal is to mutate a data frame "saint" using a long list of case_when patterns. I am able to manually create the "cases_site" list of patterns to be used in the case_when mutate to successfully create the "saint_mutated" dataframe, but I want to use a much longer dataframe in the form of "df_patterns" to populate the list of patterns from.


library("tidyverse")

#not used yet, want to populate cases_site list from this data
df_patterns <- tibble(
  match_string = c(".*site1.com", ".*site2.com", ".*site3.com"),
  site = c("villas", "brands", "club")
)

saint <- tibble(
  Key = c("site1.com", "a.site1.com", "site2.com", "site2.com/b", "site3.com")
)

#manually built, works fine
cases_site <- list(
  !! str_detect(saint$Key, ".*site1.com") ~ "villas",
  !! str_detect(saint$Key, ".*site2.com") ~ "brands",
  !! str_detect(saint$Key, ".*site3.com") ~ "club"
)

saint_mutated <- saint %>%
  mutate(Site = case_when(!!! cases_site))

Created on 2019-08-25 by the reprex package (v0.3.0)

Thanks for the quick reply! Good solution, I have used patterns with str_detect before, but stringi is new to me.

Works great except in the case of row 4, where the "site2.com/b" pattern was replaced with "brands/b" likely due to my regex patterns. Ideally, my regex would work inclusive for any strings before or after the pattern, so the "site2.com/anything-long-here" Key would result in a Site replacement of just "brands". I edited the pattern to be ".*site2.com.*" to better match the whole string, and it seemed to work. Any feedback on that change?

I didn't notice this while posting, and now can't figure out a better solution. I deleted my earlier post because of this issue.

If modifying the patterns is alright with your use case, then it should be OK. Instead of adding .* both before and after each pattern, you can use paste0 inside the function call as follows:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(stringi)

df_patterns <- tibble(match_string = c(".*site1.com", ".*site2.com", ".*site3.com"),
                      site = c("villas", "brands", "club"))

saint <- tibble(Key = c("site1.com", "a.site1.com", "site2.com", "site2.com/b", "site3.com"))

saint %>%
  mutate(Site = stri_replace_all_regex(str = Key,
                                       pattern = paste0(".*", df_patterns$match_string, ".*"),
                                       replacement = df_patterns$site,
                                       vectorize_all = FALSE))
#> # A tibble: 5 x 2
#>   Key         Site  
#>   <chr>       <chr> 
#> 1 site1.com   villas
#> 2 a.site1.com villas
#> 3 site2.com   brands
#> 4 site2.com/b brands
#> 5 site3.com   club

Thanks again Anirban. Modifying the patterns worked out.

One issue I still have is related to the difference between the case_when and pattern/regex option is for non-matches. Case_when you can specify what your non-match result will be (NA in my preferred case), but stri_replace_all always re-uses the current value if no match is found in the patterns, which is problematic with a mutate.

Anyone have suggestions on how to read in the df_patterns data frame into the cases_site list?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.