I have a function in a package that identifies duplicates. In interactive analysis, I'd use dplyr::n(), but that's not suitable for programming. So I used dplyr::count(), then dplyr::inner_join() to attach the resulting column to my original df.
In looking to optimize speed of this function, I tried dplyr::add_count(). This omits the join.
However, this new version appears to be significantly slower than the count + inner_join approach.
Any thoughts on why this is the case? And to get at my underlying question, what's the fastest way to append counts by group to a data.frame?
Here are two toy functions to illustrate this and a microbenchmark comparison:
library(dplyr)
# Creates a new data.frame of counts with count(), then joins it back to original df with inner_join
dupes_using_count_join <- function(dat, ...){
group_var <- quos(...)
counts <- dat %>%
dplyr::count(!!!group_var)
dupes <- suppressMessages(dplyr::inner_join(counts, dat))
dupes %>%
dplyr::filter(n > 1) %>%
dplyr::ungroup()
}
# Simply calls add_count()
dupes_using_add_count <- function(dat, ...){
group_var <- quos(...)
counts <- dat %>%
dplyr::add_count(!!!group_var)
dupes <- counts %>%
dplyr::filter(n > 1) %>%
# dplyr::select(!!!group_var, n, dplyr::everything()) %>% # to match order in the other function this is needed,
# leave out for more apples-to-apples performance comparison
dplyr::ungroup()
dupes
}
add_count()
is slower than creating a new data.frame with count()
and joining it:
medium_data <- data.frame(
a = rep(1:1000, 100),
b = rep("a", 100000),
c = runif(100000)
)
microbenchmark(
add_count = medium_data %>% dupes_using_add_count(a, c),
count_join = medium_data %>% dupes_using_count_join(a, c),
times = 50L
)
Unit: milliseconds
expr min lq mean median uq max neval
add_count 257.9555 283.3635 340.3743 297.7306 337.7274 686.3821 50
count_join 170.3765 181.4930 217.1137 197.2449 221.5970 425.1343 50
The gap persists if only counting the first variable, though it's smaller:
microbenchmark(
add_count = medium_data %>% dupes_using_add_count(a),
count_join = medium_data %>% dupes_using_count_join(a),
times = 50L
)
Unit: milliseconds
expr min lq mean median uq max neval
add_count 19.11190 20.81208 26.84203 21.89196 24.48095 210.42617 50
count_join 16.05706 18.07260 21.85136 19.34054 24.16837 53.73083 50