Code with mutate which precomputes max is slower than code which doesn't

dplyr

#1

I wrote a small function which computes the max of a column x for each level of a factor group, and then computes a new column max(x)-x. I thought that precomputing the max would be faster, but apparently it’s not:

ngroups <- 100
nsamples <- 10000

foo <- data.frame(group = factor(rep(seq(1, ngroups), each = nsamples)), x = runif(ngroups*nsamples, 0, nsamples))

library(microbenchmark)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(magrittr)

add_y <- function(dataset){
  dataset %<>% group_by(group) %>% mutate(max_x = max(x), y = max_x - x) %>%
  select(-max_x) %>% ungroup
}

add_y_old <- function(dataset){
  dataset %<>% group_by(group) %>% mutate(y = max(x) - x) %>% ungroup
}

microbenchmark(add_y(foo), add_y_old(foo), times = 500)
#> Unit: milliseconds
#>            expr      min        lq      mean    median        uq      max
#>      add_y(foo) 90.12524 100.13843 123.93570 128.28206 139.78511 271.1208
#>  add_y_old(foo) 39.32997  42.48748  53.04996  46.04068  54.07861 194.4961
#>  neval
#>    500
#>    500

You could argue that I’m not really precomputing the max, since I’m doing both operations in the same mutate. But even using a dedicated mutate doesn’t change results:

ngroups <- 100
nsamples <- 10000

foo <- data.frame(group = factor(rep(seq(1, ngroups), each = nsamples)), x = runif(ngroups*nsamples, 0, nsamples))

library(microbenchmark)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(magrittr)

add_y <- function(dataset){
  dataset %<>% group_by(group) %>% mutate(max_x = max(x)) %>% mutate(y = max_x - x) %>%
  select(-max_x) %>% ungroup
}

add_y_old <- function(dataset){
  dataset %<>% group_by(group) %>% mutate(y = max(x) - x) %>% ungroup
}

microbenchmark(add_y(foo), add_y_old(foo), times = 500)
#> Unit: milliseconds
#>            expr      min        lq      mean    median       uq      max
#>      add_y(foo) 90.04137 114.59789 144.40293 134.06358 154.4794 588.1589
#>  add_y_old(foo) 39.27818  43.90406  59.99724  50.58359  66.7829 315.2597
#>  neval
#>    500
#>    500

I’m a bit surprised. why is this happening? I’d like to understand why add_y_old is faster than add_y, so that I know how to write efficient dplyr code. Feel free to suggest any modification to my coding style which you feel may lead to better dplyr code.


#2

You are both precomputing the max and storing it in a column. When not storing it in a column, the difference disappears:

add_y_pre = function(d) d %>% group_by(group) %>% 
  mutate(y = {mx = max(x); mx - x})

microbenchmark(add_y(foo), add_y_old(foo), add_y_pre(foo), times = 5)

Unit: milliseconds
           expr       min        lq      mean    median        uq       max neval cld
     add_y(foo) 102.55205 107.84356 127.23669 125.59458 126.01820 174.17505     5   b
 add_y_old(foo)  45.60042  47.25717  57.92837  52.48044  55.34842  88.95540     5  a 
 add_y_pre(foo)  44.33566  51.38463  62.80592  58.12068  65.18519  95.00344     5  a 

Btw, you can use scale for this too, like

add_y_scale = function(d) d %>% group_by(group) %>% 
  mutate(y = -scale(x, center = max(x), scale = FALSE))