I wrote a small function which computes the max
of a column x
for each level of a factor group
, and then computes a new column max(x)-x
. I thought that precomputing the max
would be faster, but apparently it's not:
ngroups <- 100
nsamples <- 10000
foo <- data.frame(group = factor(rep(seq(1, ngroups), each = nsamples)), x = runif(ngroups*nsamples, 0, nsamples))
library(microbenchmark)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(magrittr)
add_y <- function(dataset){
dataset %<>% group_by(group) %>% mutate(max_x = max(x), y = max_x - x) %>%
select(-max_x) %>% ungroup
}
add_y_old <- function(dataset){
dataset %<>% group_by(group) %>% mutate(y = max(x) - x) %>% ungroup
}
microbenchmark(add_y(foo), add_y_old(foo), times = 500)
#> Unit: milliseconds
#> expr min lq mean median uq max
#> add_y(foo) 90.12524 100.13843 123.93570 128.28206 139.78511 271.1208
#> add_y_old(foo) 39.32997 42.48748 53.04996 46.04068 54.07861 194.4961
#> neval
#> 500
#> 500
You could argue that I'm not really precomputing the max
, since I'm doing both operations in the same mutate
. But even using a dedicated mutate
doesn't change results:
ngroups <- 100
nsamples <- 10000
foo <- data.frame(group = factor(rep(seq(1, ngroups), each = nsamples)), x = runif(ngroups*nsamples, 0, nsamples))
library(microbenchmark)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(magrittr)
add_y <- function(dataset){
dataset %<>% group_by(group) %>% mutate(max_x = max(x)) %>% mutate(y = max_x - x) %>%
select(-max_x) %>% ungroup
}
add_y_old <- function(dataset){
dataset %<>% group_by(group) %>% mutate(y = max(x) - x) %>% ungroup
}
microbenchmark(add_y(foo), add_y_old(foo), times = 500)
#> Unit: milliseconds
#> expr min lq mean median uq max
#> add_y(foo) 90.04137 114.59789 144.40293 134.06358 154.4794 588.1589
#> add_y_old(foo) 39.27818 43.90406 59.99724 50.58359 66.7829 315.2597
#> neval
#> 500
#> 500
I'm a bit surprised. why is this happening? I'd like to understand why add_y_old
is faster than add_y
, so that I know how to write efficient dplyr
code. Feel free to suggest any modification to my coding style which you feel may lead to better dplyr
code.