Code with mutate which precomputes max is slower than code which doesn't

I wrote a small function which computes the max of a column x for each level of a factor group, and then computes a new column max(x)-x. I thought that precomputing the max would be faster, but apparently it's not:

ngroups <- 100
nsamples <- 10000

foo <- data.frame(group = factor(rep(seq(1, ngroups), each = nsamples)), x = runif(ngroups*nsamples, 0, nsamples))

library(microbenchmark)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(magrittr)

add_y <- function(dataset){
  dataset %<>% group_by(group) %>% mutate(max_x = max(x), y = max_x - x) %>%
  select(-max_x) %>% ungroup
}

add_y_old <- function(dataset){
  dataset %<>% group_by(group) %>% mutate(y = max(x) - x) %>% ungroup
}

microbenchmark(add_y(foo), add_y_old(foo), times = 500)
#> Unit: milliseconds
#>            expr      min        lq      mean    median        uq      max
#>      add_y(foo) 90.12524 100.13843 123.93570 128.28206 139.78511 271.1208
#>  add_y_old(foo) 39.32997  42.48748  53.04996  46.04068  54.07861 194.4961
#>  neval
#>    500
#>    500

You could argue that I'm not really precomputing the max, since I'm doing both operations in the same mutate. But even using a dedicated mutate doesn't change results:

ngroups <- 100
nsamples <- 10000

foo <- data.frame(group = factor(rep(seq(1, ngroups), each = nsamples)), x = runif(ngroups*nsamples, 0, nsamples))

library(microbenchmark)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(magrittr)

add_y <- function(dataset){
  dataset %<>% group_by(group) %>% mutate(max_x = max(x)) %>% mutate(y = max_x - x) %>%
  select(-max_x) %>% ungroup
}

add_y_old <- function(dataset){
  dataset %<>% group_by(group) %>% mutate(y = max(x) - x) %>% ungroup
}

microbenchmark(add_y(foo), add_y_old(foo), times = 500)
#> Unit: milliseconds
#>            expr      min        lq      mean    median       uq      max
#>      add_y(foo) 90.04137 114.59789 144.40293 134.06358 154.4794 588.1589
#>  add_y_old(foo) 39.27818  43.90406  59.99724  50.58359  66.7829 315.2597
#>  neval
#>    500
#>    500

I'm a bit surprised. why is this happening? I'd like to understand why add_y_old is faster than add_y, so that I know how to write efficient dplyr code. Feel free to suggest any modification to my coding style which you feel may lead to better dplyr code.

You are both precomputing the max and storing it in a column. When not storing it in a column, the difference disappears:

add_y_pre = function(d) d %>% group_by(group) %>% 
  mutate(y = {mx = max(x); mx - x})

microbenchmark(add_y(foo), add_y_old(foo), add_y_pre(foo), times = 5)

Unit: milliseconds
           expr       min        lq      mean    median        uq       max neval cld
     add_y(foo) 102.55205 107.84356 127.23669 125.59458 126.01820 174.17505     5   b
 add_y_old(foo)  45.60042  47.25717  57.92837  52.48044  55.34842  88.95540     5  a 
 add_y_pre(foo)  44.33566  51.38463  62.80592  58.12068  65.18519  95.00344     5  a 

Btw, you can use scale for this too, like

add_y_scale = function(d) d %>% group_by(group) %>% 
  mutate(y = -scale(x, center = max(x), scale = FALSE))
4 Likes