Hi all,
I've noticed that dplyr::rowwise()
is back on the table (" rowwise()
is no longer questioning", from dplyr/NEWS.md at main · tidyverse/dplyr · GitHub ).
I am happy about that since the syntax is sleek, but I wonder if there are any reason to believe that rowwise()
based workflows could become much faster in the near future?
For now, the small following benchmark based on the non-(fully)-vectorized function dplyr::between()
shows that purrr::pmap()
remains much more efficient (both in terms of CPU and memory) when datasets get relatively large:
library(tidyverse)
set.seed(1)
iris_big <- as_tibble(iris[sample(1:nrow(iris), 5e+5, replace = TRUE), ])
iris_big$Sepal.Width <- iris_big$Sepal.Width + 2 # for test below not to be just TRUE
test_big <- bench::mark(
vectorised_between = {iris_big %>%
mutate(test = Sepal.Width >= Petal.Length & Sepal.Width <= Sepal.Length)},
pmap_between = {
iris_big %>%
mutate(test = pmap_lgl(list(Sepal.Width, Petal.Length, Sepal.Length), between))},
rowwise_between = {
iris_big %>%
rowwise() %>%
mutate(test = between(Sepal.Width, Petal.Length, Sepal.Length)) %>%
ungroup()},
iterations = 10)
test_big
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 vectorised_between 22.69ms 36.83ms 24.9 24.3MB 17.4
#> 2 pmap_between 1.17s 1.26s 0.794 19.1MB 4.69
#> 3 rowwise_between 9.83s 10.3s 0.0972 101.9MB 3.73
plot(test_big)
My immediate interest is that I will soon be attempting to convert SPSS people to use R and these people deal with large datasets only. I wonder whether I could spare "purrring" them...
PS: I used here between()
only as an example and I do know that many tasks can be vectorized.
4 Likes
Perhaps the answer to my question is yes:
opened 01:16PM - 05 Jun 19 UTC
closed 05:20PM - 11 Jul 22 UTC
performance
grouping
I've been studying and writing about grouped operations over the past couple of
…
months, and one thing that has become apparent to me is that ordering the data
can in at least some circumstances provide a massive performance improvement.
This is primarily due to `data.table` sharing their radix sort with base R in
3.3.0 which makes it much faster than it used to be.
I think `dplyr` could take advantage of this pretty easily as evidenced by:
```
# R3.6.0 w/ dplyr 0.8.1
set.seed(42)
library(dplyr)
library(tibble)
x <- runif(1e7)
grp <- sample(1e6, 1e7, replace=TRUE)
tib <- tibble(grp, x)
system.time(a <- tib %>% group_by(grp) %>% summarise(sum(x)))
## user system elapsed
## 10.369 0.311 10.793
system.time({
o <- order(grp)
tibo <- tibble(grp=grp[o], x=x[o])
b <- tibo %>% group_by(grp) %>% summarise(sum(x))
})
## user system elapsed
## 3.120 0.328 3.474
all.equal(a, b)
## [1] TRUE
```
The slow step is `group_by`.
I'm guessing much of the dplyr code was written prior to the `data.table` radix
sort becoming part of `order`, so I imagine that may have guided some of the
original design decisions. It seems now that it is an easy win to add a
pre-order to `dplyr`. There is a penalty for cases where the data is already
ordered, but it is small, and as written above there is some additional memory
usage.
I have not tested this thoroughly, so your mileage may vary across input types.
However the benefit is substantial enough in some cases that it may be worth
spending some time exploring the broader applicability of the change.
I am not familiar with C++, so I have not dug into the sources to narrow down
the problem, but I'm assuming it's a manifestation of similar microarchitectural
factors that affect `split` that are discussed in [this blog post][1].
This does not appear related to #4334 AFAICT from looking at memory usage via `gc()`:
```
> library(tibble)
> x <- runif(1e7)
> grp <- sample(1e6, 1e7, replace=TRUE)
> tib <- tibble(grp, x)
> gc()
used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells 491743 26.3 939424 50.2 NA 939424 50.2
Vcells 15868290 121.1 25428491 194.1 16384 15915118 121.5
> system.time(a <- tib %>% group_by(grp) %>% summarise(sum(x)))
user system elapsed
11.301 0.404 11.866
> gc()
used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells 496226 26.6 2478724 132.4 NA 1697148 90.7
Vcells 17377752 132.6 36793026 280.8 16384 25214184 192.4
> system.time({
+ o <- order(grp)
+ tibo <- tibble(grp=grp[o], x=x[o])
+ b <- tibo %>% group_by(grp) %>% summarise(sum(x))
+ })
user system elapsed
3.472 0.270 3.777
> gc()
used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells 496253 26.6 1554928 83.1 NA 1943660 103.9
Vcells 38877692 296.7 67779293 517.2 16384 46702085 356.4
> object.size(tib)
120000984 bytes
```
[1]: https://www.brodieg.com/2019/05/17/pixie-dust/
Nothing to add here except for good luck and have fun converting SPSS users Thanks for opening this topic, it's an interesting one - I too enjoy the rowwise
/ungroup
workflow.
1 Like
system
Closed
March 1, 2020, 7:18pm
4
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.