Hi all,
I've noticed that dplyr::rowwise()
is back on the table (" rowwise()
is no longer questioning", from https://github.com/tidyverse/dplyr/blob/master/NEWS.md).
I am happy about that since the syntax is sleek, but I wonder if there are any reason to believe that rowwise()
based workflows could become much faster in the near future?
For now, the small following benchmark based on the non-(fully)-vectorized function dplyr::between()
shows that purrr::pmap()
remains much more efficient (both in terms of CPU and memory) when datasets get relatively large:
library(tidyverse)
set.seed(1)
iris_big <- as_tibble(iris[sample(1:nrow(iris), 5e+5, replace = TRUE), ])
iris_big$Sepal.Width <- iris_big$Sepal.Width + 2 # for test below not to be just TRUE
test_big <- bench::mark(
vectorised_between = {iris_big %>%
mutate(test = Sepal.Width >= Petal.Length & Sepal.Width <= Sepal.Length)},
pmap_between = {
iris_big %>%
mutate(test = pmap_lgl(list(Sepal.Width, Petal.Length, Sepal.Length), between))},
rowwise_between = {
iris_big %>%
rowwise() %>%
mutate(test = between(Sepal.Width, Petal.Length, Sepal.Length)) %>%
ungroup()},
iterations = 10)
test_big
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 vectorised_between 22.69ms 36.83ms 24.9 24.3MB 17.4
#> 2 pmap_between 1.17s 1.26s 0.794 19.1MB 4.69
#> 3 rowwise_between 9.83s 10.3s 0.0972 101.9MB 3.73
plot(test_big)
My immediate interest is that I will soon be attempting to convert SPSS people to use R and these people deal with large datasets only. I wonder whether I could spare "purrring" them...
PS: I used here between()
only as an example and I do know that many tasks can be vectorized.