Why does combining count & inner_join outperform add_count?

samfirke · January 10, 2018, 4:26pm

I have a function in a package that identifies duplicates. In interactive analysis, I'd use dplyr::n(), but that's not suitable for programming. So I used dplyr::count(), then dplyr::inner_join() to attach the resulting column to my original df.

In looking to optimize speed of this function, I tried dplyr::add_count(). This omits the join.

However, this new version appears to be significantly slower than the count + inner_join approach.

Any thoughts on why this is the case? And to get at my underlying question, what's the fastest way to append counts by group to a data.frame?

Here are two toy functions to illustrate this and a microbenchmark comparison:

library(dplyr)
# Creates a new data.frame of counts with count(), then joins it back to original df with inner_join
dupes_using_count_join <- function(dat, ...){
  group_var <- quos(...)
  counts <- dat %>%
    dplyr::count(!!!group_var)
  
  dupes <- suppressMessages(dplyr::inner_join(counts, dat))
  
  dupes %>%
    dplyr::filter(n > 1) %>%
    dplyr::ungroup()
  
}

# Simply calls add_count()
dupes_using_add_count <- function(dat, ...){
  group_var <- quos(...)
  counts <- dat %>%
    dplyr::add_count(!!!group_var)

  dupes <- counts %>%
    dplyr::filter(n > 1) %>%
  #  dplyr::select(!!!group_var, n, dplyr::everything()) %>% # to match order in the other function this is needed,
                                                             # leave out for more apples-to-apples performance comparison
    dplyr::ungroup()
  
  dupes
}

add_count() is slower than creating a new data.frame with count() and joining it:


medium_data <- data.frame(
  a = rep(1:1000, 100),
  b = rep("a", 100000),
  c = runif(100000)
) 

microbenchmark(
  add_count = medium_data %>% dupes_using_add_count(a, c),
  count_join = medium_data %>% dupes_using_count_join(a, c),
  times = 50L
)

Unit: milliseconds
       expr      min       lq     mean   median       uq      max neval
  add_count 257.9555 283.3635 340.3743 297.7306 337.7274 686.3821    50
 count_join 170.3765 181.4930 217.1137 197.2449 221.5970 425.1343    50

The gap persists if only counting the first variable, though it's smaller:

microbenchmark(
  add_count = medium_data %>% dupes_using_add_count(a),
  count_join = medium_data %>% dupes_using_count_join(a),
  times = 50L
)

Unit: milliseconds
       expr      min       lq     mean   median       uq       max neval
  add_count 19.11190 20.81208 26.84203 21.89196 24.48095 210.42617    50
 count_join 16.05706 18.07260 21.85136 19.34054 24.16837  53.73083    50

mara · January 10, 2018, 7:49pm

Possibly related, an issue @winston submitted re. filtering grouped data being esp. slow:

github.com/tidyverse/dplyr

Filtering grouped data is slow

opened 09:15PM - 08 Jan 18 UTC

closed 04:57PM - 26 Jan 18 UTC

wch

performance

EDIT: I've added a much simpler example at the top. I've found that filtering… grouped data is slow when there are many groups, even when the filter condition is orthogonal to the grouping. It's possible I'm hoping for too much intelligence here -- that dplyr can detect when the grouping is relevant for the filtering condition(s). Example: ```R dat <- tibble( g = rep(1:1e5, length.out = 1e6), x = rnorm(1e6) ) system.time({ dat %>% filter(x > 0) }) # user system elapsed # 0.012 0.000 0.012 system.time({ dat %>% group_by(g) %>% filter(x > 0) }) # user system elapsed # 2.325 0.016 2.344 ``` EDIT: Original example below: I have a gist here with data: https://gist.github.com/wch/15bce85635d7e035126681f81900fa47 To reproduce, clone the gist, enter the directory, and run this code: ```R x <- readRDS("x.rds") # # A tibble: 102,524 x 4 # # Groups: Package, Version [60,215] # Package Version type n # <chr> <chr> <chr> <int> # 1 A3 0.9.1 Depends 2 # 2 A3 0.9.1 Suggests 2 # 3 A3 0.9.2 Depends 2 # 4 A3 0.9.2 Suggests 2 # 5 A3 1.0.0 Depends 2 # 6 A3 1.0.0 Suggests 2 # 7 abbyyR 0.1 Imports 2 # 8 abbyyR 0.2 Imports 5 # 9 abbyyR 0.2.1 Imports 5 # 10 abbyyR 0.2.1 Suggests 2 # # ... with 102,514 more rows system.time({ x %>% filter(Package == "shiny", Version == "1.0.5") }) # user system elapsed # 7.385 0.107 7.588 system.time({ x %>% ungroup() %>% filter(Package == "shiny", Version == "1.0.5") }) # user system elapsed # 0.003 0.001 0.004 ```