A better way to summarize semi joining rows

aampohl · July 23, 2018, 5:52pm

Is there a better way to do this? I want to add a variable that says whether each row in a data frame semi-joins with another data frame, I have a function:

can_semi_join <- function(df1, df2, by, result_colname) {
  bind_rows( 
    semi_join(df1, df2, by) %>% mutate(!!result_colname := TRUE),
    anti_join(df1, df2, by) %>% mutate(!!result_colname := FALSE)
  )
}

I really don't like this solution, since it's basically doing the same set of matching operations twice. I feel like this must be a function that already exists somewhere and I just can't find it in the documentation.

Leon · July 24, 2018, 1:01pm

It's easier to figure out what's going on, if you supply a reproducible example. The following could be an example of such:

set.seed(498729)
can_semi_join <- function(df1, df2, by, result_colname) {
  bind_rows( 
    semi_join(df1, df2, by) %>% mutate(!!result_colname := TRUE),
    anti_join(df1, df2, by) %>% mutate(!!result_colname := FALSE)
  )
}

X = tibble(v = sample(LETTERS, 10))
Y = tibble(v = sample(LETTERS, 10))
can_semi_join(X, Y, "v", "res")
# A tibble: 10 x 2
   v     res  
   <chr> <lgl>
 1 W     TRUE 
 2 D     TRUE 
 3 B     FALSE
 4 X     FALSE
 5 T     FALSE
 6 Y     FALSE
 7 J     FALSE
 8 C     FALSE
 9 U     FALSE
10 P     FALSE

You can get the same results like so, which also prevents shuffling of your rows:

X %>% mutate(res = v %in% Y$v)
# A tibble: 10 x 2
   v     res  
   <chr> <lgl>
 1 B     FALSE
 2 X     FALSE
 3 T     FALSE
 4 W     TRUE 
 5 D     TRUE 
 6 Y     FALSE
 7 J     FALSE
 8 C     FALSE
 9 U     FALSE
10 P     FALSE

Which is equivalent to the following, if you want to avoid the df$var notation

X %>% mutate(res = v %in% (Y %>% pull(v)))

...and what you seem to be looking for is the intersection, which exists as the function intersect() in base. But that will return the elements, rather than assign TRUE or FALSE to each element, which is what you're looking for - Hope it's helpful

aampohl · July 25, 2018, 1:47pm

Thanks, that's helpful. I'm going to also try with the "by" argument being multivariate. I'm not sure this will work the same in that case, so I'll try. Nevertheless I'm happy I wasn't overlooking one of the more obscure functions in dplyr or tidyr.