tidyverse solution for "not any of"

adometrius · December 7, 2020, 8:08pm

I am trying to find a solution that I think requires the use of some "not any of" function. I have a large set of applications that run in several different sites. Some of these sites are closing and I am trying to find a nice solution to determine which sites the applications can move to. Here is a look at some example data:

   App_ID Current_Site  Env
1       1        Alpha  Dev
2       1        Alpha Test
3       1        Bravo Prod
4       1      Charlie Prod
5       2        Alpha  Dev
6       2        Alpha Test
7       2        Delta Prod
8       3        Bravo  Dev
9       3      Charlie Test
10      3        Bravo Prod
11      3        Delta Prod
12      4        Bravo  Dev
13      4      Charlie Test
14      4         Echo Prod

The "Alpha" and "Beta" sites represent sites that are closing while "Charlie", "Delta", and "Echo" will remain open. Geographic Diversity by environment is also something that needs to be preserved (or implemented if not already). For simple geographic reasons, I first would plan to move anything from site Alpha to Charlie, and from site Bravo to Delta. But if this move causes the geographic diversity requirement to break, I would like to attempt to select an alternate site.

library(dplyr)  
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
  App_ID <- c("1","1","1","1","2","2","2","3","3","3","3","4","4","4")
  Current_Site <- c("Alpha","Alpha","Bravo","Charlie","Alpha","Alpha","Delta","Bravo","Charlie","Bravo","Delta","Bravo","Charlie","Echo")
  Env <- c("Dev", "Test", "Prod", "Prod", "Dev", "Test", "Prod", "Dev", "Test", "Prod", "Prod", "Dev", "Test", "Prod")
  df <- data.frame(App_ID, Current_Site, Env)
  df2 <- df %>% 
    mutate(Target_Site = case_when(
      Current_Site == "Alpha" ~ "Charlie",
      Current_Site == "Bravo" ~ "Delta",
      Current_Site == "Charlie" ~ "NONE",
      Current_Site == "Delta" ~ "NONE"
    )
    ) %>% 
    group_by(App_ID, Env) %>% 
    mutate(Backup_Target_Site = ifelse(any(Current_Site %in% Target_Site), "Do Something", "NONE")) %>% 
    ungroup()
Created on 2020-12-07 by the reprex package (v0.3.0)

Here is the output:

> df2
# A tibble: 14 x 5
   App_ID Current_Site Env   Target_Site Backup_Target_Site
   <fct>  <fct>        <fct> <chr>       <chr>             
 1 1      Alpha        Dev   Charlie     NONE              
 2 1      Alpha        Test  Charlie     NONE              
 3 1      Bravo        Prod  Delta       NONE              
 4 1      Charlie      Prod  NONE        NONE              
 5 2      Alpha        Dev   Charlie     NONE              
 6 2      Alpha        Test  Charlie     NONE              
 7 2      Delta        Prod  NONE        NONE              
 8 3      Bravo        Dev   Delta       NONE              
 9 3      Charlie      Test  NONE        NONE              
10 3      Bravo        Prod  Delta       Do Something      
11 3      Delta        Prod  NONE        Do Something      
12 4      Bravo        Dev   Delta       NONE              
13 4      Charlie      Test  NONE        NONE              
14 4      Echo         Prod  NONE        NONE

What I am struggling with is to figure out the right function for "Do Something". As you can see for App 3, the initial choice to move from Bravo to Delta for the prod environment would prevent geographic diversity. I would like to randomly select any site between "Charlie", "Delta", and "Echo" so long as that site does not already exist within 'Current_Site" for the group (App_ID, Env).

I have tried to include various negated forms of "any", "contains", "matches", etc. but they are either unsuccessful or throw an error. Is there a simple tidyverse solution for this?

AlexisW · December 7, 2020, 11:34pm

I would go with making a separate list of possible sites:

Possible_Sites <- c("Charlie","Delta","Echo")

If the geographic diversity criterion is broken, then you can take these Possible_Sites, and remove the Target_Sites that are already used with setdiff(). Since you may have several, you can just use sample() to randomly take one. So this becomes:

df %>% 
  mutate(Target_Site = case_when(
    Current_Site == "Alpha" ~ "Charlie",
    Current_Site == "Bravo" ~ "Delta",
    Current_Site == "Charlie" ~ "NONE",
    Current_Site == "Delta" ~ "NONE"
  )
  ) %>% 
  group_by(App_ID, Env) %>% 
  mutate(Backup_Target_Site = ifelse(any(Current_Site %in% Target_Site),
                                     sample(setdiff(Possible_Sites, Target_Site),1),
                                     "NONE")) %>% 
  ungroup()
#> # A tibble: 14 x 5
#>    App_ID Current_Site Env   Target_Site Backup_Target_Site
#>    <chr>  <chr>        <chr> <chr>       <chr>             
#>  1 1      Alpha        Dev   Charlie     NONE              
#>  2 1      Alpha        Test  Charlie     NONE              
#>  3 1      Bravo        Prod  Delta       NONE              
#>  4 1      Charlie      Prod  NONE        NONE              
#>  5 2      Alpha        Dev   Charlie     NONE              
#>  6 2      Alpha        Test  Charlie     NONE              
#>  7 2      Delta        Prod  NONE        NONE              
#>  8 3      Bravo        Dev   Delta       NONE              
#>  9 3      Charlie      Test  NONE        NONE              
#> 10 3      Bravo        Prod  Delta       Charlie           
#> 11 3      Delta        Prod  NONE        Charlie           
#> 12 4      Bravo        Dev   Delta       NONE              
#> 13 4      Charlie      Test  NONE        NONE              
#> 14 4      Alpha        Prod  Charlie     NONE

^{Created on 2020-12-07 by the reprex package (v0.3.0)}

Where the rows 10 and 11 may have Charlie or Echo randomly at each run.

At this point there is still one thing unsatisfying, which is the case where no value is available, in that case the entire pipeline fails with an error. If this is a case that can occur, you will need to be able to recover from the error, for example putting the code in a separate function:


library(dplyr)  
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

App_ID <- c("1","1","1","1","2","2","2","3","3","3","3","3","4","4","4")
Current_Site <- c("Alpha","Alpha","Bravo","Charlie","Alpha","Alpha","Delta","Bravo","Charlie","Bravo","Delta","Alpha","Bravo","Charlie","Alpha")
Env <- c("Dev", "Test", "Prod", "Prod", "Dev", "Test", "Prod", "Dev", "Test", "Prod", "Prod","Prod", "Dev", "Test", "Prod")
df <- data.frame(App_ID, Current_Site, Env)


get_site <- function(Target_Site){
  tryCatch(sample(setdiff(Possible_Sites, Target_Site),1),
           error = function(e) "PROBLEM")
}

Possible_Sites <- c("Charlie","Delta")
df %>% 
  mutate(Target_Site = case_when(
    Current_Site == "Alpha" ~ "Charlie",
    Current_Site == "Bravo" ~ "Delta",
    Current_Site == "Charlie" ~ "NONE",
    Current_Site == "Delta" ~ "NONE"
  )
  ) %>% 
  group_by(App_ID, Env) %>% 
  mutate(Backup_Target_Site = ifelse(any(Current_Site %in% Target_Site),
                                     get_site(Target_Site), "NONE")) %>% 
  ungroup()
#> # A tibble: 15 x 5
#>    App_ID Current_Site Env   Target_Site Backup_Target_Site
#>    <chr>  <chr>        <chr> <chr>       <chr>             
#>  1 1      Alpha        Dev   Charlie     NONE              
#>  2 1      Alpha        Test  Charlie     NONE              
#>  3 1      Bravo        Prod  Delta       NONE              
#>  4 1      Charlie      Prod  NONE        NONE              
#>  5 2      Alpha        Dev   Charlie     NONE              
#>  6 2      Alpha        Test  Charlie     NONE              
#>  7 2      Delta        Prod  NONE        NONE              
#>  8 3      Bravo        Dev   Delta       NONE              
#>  9 3      Charlie      Test  NONE        NONE              
#> 10 3      Bravo        Prod  Delta       PROBLEM           
#> 11 3      Delta        Prod  NONE        PROBLEM           
#> 12 3      Alpha        Prod  Charlie     PROBLEM           
#> 13 4      Bravo        Dev   Delta       NONE              
#> 14 4      Charlie      Test  NONE        NONE              
#> 15 4      Alpha        Prod  Charlie     NONE

^{Created on 2020-12-07 by the reprex package (v0.3.0)}

(in this example I added a row with Alpha->Charlie to make it unavailable, and removed Echo from the list of possibilities)

adometrius · December 8, 2020, 1:05am

Thanks Alexis. I actually considered a similar approach of creating a list of the target sites. I was just hoping to find a solution that would be a bit more self-contained (if that's the right term). But this is definitely one approach.

AlexisW · December 8, 2020, 1:21am

In your example, how should the mutate() line be aware that "Echo" exists and is a possibility, if you don't have a global list?

I'm not saying it's impossible, just that the information has to come from somewhere; with the approach you have here I don't see how you can use that information without defining it somewhere.

adometrius · December 8, 2020, 1:27am

Well I was thinking one possibility would be to treat Current_Site as a factor and use the levels. But then I need a way to exclude the closing sites within the mutate or deal with them in a subsequent action.

In my dataset, "Current_Site" includes every possible site at least once, so it would be included as a factor level.

AlexisW · December 8, 2020, 2:30am

In that case Target_Site should contain every possible site, except those that closed. So in that case you can easily adapt the previous code:

get_site <- function(Target_Site){
  tryCatch(sample(setdiff(levels(Target_Site), Target_Site),1),
           error = function(e) "PROBLEM")
}

df %>% 
  mutate(Target_Site = as.factor(case_when(
    Current_Site == "Alpha" ~ "Charlie",
    Current_Site == "Bravo" ~ "Delta",
    Current_Site == "Charlie" ~ "NONE",
    Current_Site == "Delta" ~ "NONE"
  ))
  ) %>% 
  group_by(App_ID, Env) %>% 
  mutate(Backup_Target_Site = ifelse(any(Current_Site %in% Target_Site),
                                     get_site(Target_Site), "NONE")) %>% 
  ungroup()

In your example data, Echo is not represented in Current_Site, so it won't appear in Target_site.

adometrius · December 8, 2020, 3:56am

Echo was actually in the output I shared but you're correct that it wasn't in the data that formed the initial dataframe. Just a miss when generating the reprex data.

system · December 29, 2020, 3:56am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.