Why does combining count & inner_join outperform add_count?

dplyr

#1

I have a function in a package that identifies duplicates. In interactive analysis, I’d use dplyr::n(), but that’s not suitable for programming. So I used dplyr::count(), then dplyr::inner_join() to attach the resulting column to my original df.

In looking to optimize speed of this function, I tried dplyr::add_count(). This omits the join.

However, this new version appears to be significantly slower than the count + inner_join approach.

Any thoughts on why this is the case? And to get at my underlying question, what’s the fastest way to append counts by group to a data.frame?

Here are two toy functions to illustrate this and a microbenchmark comparison:

library(dplyr)
# Creates a new data.frame of counts with count(), then joins it back to original df with inner_join
dupes_using_count_join <- function(dat, ...){
  group_var <- quos(...)
  counts <- dat %>%
    dplyr::count(!!!group_var)
  
  dupes <- suppressMessages(dplyr::inner_join(counts, dat))
  
  dupes %>%
    dplyr::filter(n > 1) %>%
    dplyr::ungroup()
  
}

# Simply calls add_count()
dupes_using_add_count <- function(dat, ...){
  group_var <- quos(...)
  counts <- dat %>%
    dplyr::add_count(!!!group_var)

  dupes <- counts %>%
    dplyr::filter(n > 1) %>%
  #  dplyr::select(!!!group_var, n, dplyr::everything()) %>% # to match order in the other function this is needed,
                                                             # leave out for more apples-to-apples performance comparison
    dplyr::ungroup()
  
  dupes
}

add_count() is slower than creating a new data.frame with count() and joining it:


medium_data <- data.frame(
  a = rep(1:1000, 100),
  b = rep("a", 100000),
  c = runif(100000)
) 

microbenchmark(
  add_count = medium_data %>% dupes_using_add_count(a, c),
  count_join = medium_data %>% dupes_using_count_join(a, c),
  times = 50L
)

Unit: milliseconds
       expr      min       lq     mean   median       uq      max neval
  add_count 257.9555 283.3635 340.3743 297.7306 337.7274 686.3821    50
 count_join 170.3765 181.4930 217.1137 197.2449 221.5970 425.1343    50

The gap persists if only counting the first variable, though it’s smaller:

microbenchmark(
  add_count = medium_data %>% dupes_using_add_count(a),
  count_join = medium_data %>% dupes_using_count_join(a),
  times = 50L
)

Unit: milliseconds
       expr      min       lq     mean   median       uq       max neval
  add_count 19.11190 20.81208 26.84203 21.89196 24.48095 210.42617    50
 count_join 16.05706 18.07260 21.85136 19.34054 24.16837  53.73083    50

#3

Possibly related, an issue @winston submitted re. filtering grouped data being esp. slow: