The .data pronoun, summarise and speed

dplyr

#1

I'm writing a package that makes a lot of use of dplyr::summarise on grouped data frames. Obviously to make it robust (and to avoid R CMD CHECK complaining) I need to use the .data pronoun. However, I'm finding that using .data with summarise slows things down considerably for large datasets. So, should I be using something else? The reprex below illustrates the issue:

(Note there are ~ 33,000,000 rows of data, so don't try it if you're short on RAM!)

suppressPackageStartupMessages(library(tidyverse))
library(microbenchmark)

counter <- 0
df <- list()

for (i in 1:10) {
  for (j in 1:12) {
    for (k in 1:50) {
      counter <- counter + 1
      df[[counter]] <- tibble(
        name = "name",
        id   = seq(1,5500),
        day  = i,
        time = j,
        mbr  = k,
        fcst = rnorm(5500)
      ) 
    }
  }
}
df <- bind_rows(df)

no_pronoun <- function(data) {
  data %>% 
  group_by(.data$name, .data$id, .data$day, .data$time) %>% 
  summarise(fcst = mean(fcst))
}

with_pronoun <-  function(data) {
  data %>% 
  group_by(.data$name, .data$id, .data$day, .data$time) %>% 
  summarise(fcst = mean(.data$fcst))
}

microbenchmark(no_pronoun(df), with_pronoun(df), times = 1)
#> Unit: seconds
#>              expr       min        lq      mean    median        uq
#>    no_pronoun(df)  4.292153  4.292153  4.292153  4.292153  4.292153
#>  with_pronoun(df) 39.685421 39.685421 39.685421 39.685421 39.685421
#>        max neval
#>   4.292153     1
#>  39.685421     1

#2

Another way to avoid the R CMD CHECK issue is to make the abstract names look like bound variables in your functions. In your no_pronoun() function add a few lines like (towards the top):

name <- NULL # make sure does not look like an unbound reference
id <- NULL # make sure does not look like an unbound reference
day <- NULL # make sure does not look like an unbound reference
time <- NULL # make sure does not look like an unbound reference

Thus one can write no-pronoun style code that checks.

However, I do understand you wanted to use the pronoun for reasons in addition to the R CMD CHECK issue.


#3

Or for even more speed, give data.table a try (notice all but one of the column names was a string, and I am sure you can parameterize the last one also).


suppressPackageStartupMessages(library("tidyverse"))
library("microbenchmark")
library("data.table")
#> 
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last
#> The following object is masked from 'package:purrr':
#> 
#>     transpose

counter <- 0
df <- list()

for (i in 1:10) {
  for (j in 1:12) {
    for (k in 1:50) {
      counter <- counter + 1
      df[[counter]] <- tibble(
        name = "name",
        id   = seq(1,5500),
        day  = i,
        time = j,
        mbr  = k,
        fcst = rnorm(5500)
      )
    }
  }
}
df <- bind_rows(df)

no_pronoun <- function(data) {
  data %>%
    group_by(name, id, day, time) %>%
    summarise(fcst = mean(fcst))
}

with_pronoun <-  function(data) {
  data %>%
    group_by(.data$name, .data$id, .data$day, .data$time) %>%
    summarise(fcst = mean(.data$fcst))
}

with_data.table <- function(data) {
  dT <- as.data.table(data)
  dT[ , j = list("fcst" = mean(fcst)), by = c("name", "id", "day", "time")]
}

microbenchmark(no_pronoun(df), with_pronoun(df), with_data.table(df), times = 5)
#> Unit: seconds
#>                 expr       min        lq      mean    median        uq
#>       no_pronoun(df)  6.349059  6.806152  7.063423  6.953012  7.245070
#>     with_pronoun(df) 38.487364 38.640724 39.462571 38.995131 40.088538
#>  with_data.table(df)  3.319591  3.445687  4.167679  4.542396  4.616276
#>        max neval
#>   7.963823     5
#>  41.101098     5
#>   4.914446     5

#4

Interesting results. I keep thinking that I need to look into data.table much more, but then I see benchmark such as this and results are not as impressive as I imagine them to be (considering data.table notorious somewhat special syntax).

Are there classes of problems where data.table is significantly (and consistently) faster than dplyr? One thing I know are rolling joins, but are there others?


#5

The difference generally becomes apparent when dealing with much larger sets of data.

The biggest difference in my common usage comes when comparing dplyr::group_by() to the data.table equivalent, e.g. when generating millions of groups. However, this might not apply to your data.

The other major performance gain (RAM usage and speed) is in not copying unnecessarily when mutating the dataframe/table/tibble. This alone can make the difference between something running and crashing. Again, this may not apply to your use cases.


#6

I guess checking for the expected column names and making sure they exists as variables inside the function will certainly avoid the problem of silently making use of global variables if they exist, which is what I and R CMD CHECK want to avoid. It does seem like a bit of a fudge though!

I used data.table in a previous version of this package, but have my reasons (not necessarily good reasons!) for not using it this time around.


#7

data.table has a benchmarking wiki in their github repo that may shed some light on your question.


#8

In my experience data.table is usually faster than dplyr, and in many cases much faster. Even in the first example data.table was 2 to 10 times faster, depending on which variation of dplyr you are comparing to. And if it isn't obvious going in which variations of dplyr are slower, you don't know you are not using a slow variation of dplyr in your own work (it would be a pain to try all dplyr syntaxes for every problem and then try to settle on the least slow variation). Another source of speed variation in dplyr is grouped filtering.

Here is an example where it is routinely 10x faster over a wide range of problem sizes: http://www.win-vector.com/blog/2018/06/rqdatatable-rquery-powered-by-data-table/ .

And the new rqdatatable package lets one use a piped-Codd style syntax if that is what you are used to.


suppressPackageStartupMessages(library("tidyverse"))
library("microbenchmark")
library("data.table")
#> 
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last
#> The following object is masked from 'package:purrr':
#> 
#>     transpose
library("rqdatatable")
#> Loading required package: rquery

counter <- 0
df <- list()

for (i in 1:10) {
  for (j in 1:12) {
    for (k in 1:50) {
      counter <- counter + 1
      df[[counter]] <- tibble(
        name = "name",
        id   = seq(1,5500),
        day  = i,
        time = j,
        mbr  = k,
        fcst = rnorm(5500)
      )
    }
  }
}
df <- bind_rows(df)

no_pronoun <- function(data) {
  data %>%
    group_by(name, id, day, time) %>%
    summarise(fcst = mean(fcst))
}

with_pronoun <-  function(data) {
  data %>%
    group_by(.data$name, .data$id, .data$day, .data$time) %>%
    summarise(fcst = mean(.data$fcst))
}

with_data.table <- function(data) {
  dT <- as.data.table(data)
  dT[ , j = list("fcst" = mean(fcst)), by = c("name", "id", "day", "time")]
}

with_rqdatatable <- function(data) {
  local_td(data) %.>%
    project_nse(.,
                groupby = c("name", "id", "day", "time"),
                fcst = mean(fcst)) %.>%
    ex_data_table(.)
}

microbenchmark(no_pronoun(df),
               with_pronoun(df),
               with_data.table(df),
               with_rqdatatable(df),
               times = 5)
#> Unit: seconds
#>                  expr       min        lq      mean    median        uq
#>        no_pronoun(df)  6.948234  7.856568  8.616359  8.514090  9.569410
#>      with_pronoun(df) 39.785763 40.689116 41.957031 41.584065 41.850011
#>   with_data.table(df)  3.442202  3.703897  4.000662  3.790576  4.043888
#>  with_rqdatatable(df)  3.045504  3.314732  3.997600  3.982759  4.755187
#>        max neval
#>  10.193493     5
#>  45.876202     5
#>   5.022747     5
#>   4.889819     5

#9

Hi, I agree with @Andrew, utils::globalVariables() may be the best way to go about this.


#10

You could check out the seplyr package. I'm not well-versed in how it works; I just think of it like wrappers around the tidyeval code we all hate writing. Most packages don't need non-standard evaluation; only super-generalized metatools (like dplyr or ggplot2) make good use of it. For the rest, the author should have a good idea what the inputs and outputs will be.

If you expect the input to have a specific structure, consider wrapping those expectations in an S3 subclass of data.frame. Also, it's dangerous to use $-subsetting on objects not created inside the function; it can do partial matching. If data were a data.frame with a column named daylight but no column named day, then data$day would return the daylight column.

# Of course, change the class name to whatever's appropriate
prognosis <- function(name, id, day, time, mbr, fcst) {
  output <- tibble(
    name = name,
    id   = id,
    day  = day
    time = time,
    mbr  = mbr,
    fcst = fcst
  )
  class(output) <- c("prognosis", class(output))
  output
}

standard_evaluation_form <- function(data) {
  stopifnot(inherits(data, "prognosis"))
  data %>% 
    seplyr::group_by_se(c("name", "id", "day", "time")) %>% 
    summarise_at(.vars = "fcst", .funs = mean)
}

I'm a staunch believer in data.table (executing arbitrary expressions inside subsets is amazing), but there are plenty of reasons to use dplyr. No judgement.


#11

seplyr author here. seplyr is indeed just thin wrappers on dplyr using rlang. In fact printing the seplyr methods is a good way to remember how to use some rlang conventions (at least how they were described at the time I wrote seplyr, rlang has changed notation recommendations a few times).

The re-writing solution I am promoting is wrapr::let(), it is pre-dates rlang and is really neat.