Thanks @andresrcs,
In general, on other datasets, I am a big fan of pivoting things to get tidy data but here the situation is a little more tricky.
First, the datasets I am dealing with can be quite large (could be a few GB) and they are roughly squared (as wide as long).
Second (although not as crucial), I would prefer not to change the data structure since other people are working on them using different tools. So calls to pivot_longer would probably involve pivot_wider counter calls.
For these reasons (but mostly the first one), pivoting does not seem to be a good option and I would rather stick to an iterative row-wise approach if possible. I am just looking for a solution that is visually pleasing and that does not look too low level.
FYI, I have compared the 3 approaches below.
Admittedly, the comparison may be a bit biased since I was not able to figure out a way to unpivot things properly without triggering warnings, so perhaps it could be done more efficiently. In any case, the cost triggered by pivoting is huge both in terms of CPU and memory footprint (and the code does not become more simple IMO):
options(warn = -1)
fn_pivot <- function(x){
x %>%
pivot_longer(
everything(),
names_to = c(".value", "index"),
names_pattern = "(.)(.)"
) %>%
mutate(ab = a + b) %>%
pivot_wider(
names_from = index,
values_from = - index,
names_sep = "") %>%
unnest(everything())
}
fn_rowwise <- function(x) {
for(i in 1:2) {
x %<>%
mutate(!!sym(glue("ab{i}")) := !!sym(glue("a{i}")) + !!sym(glue("b{i}")))
}
x
}
fn_pmap <- function(x) {
for(i in 1:2) {
x %<>%
mutate(!!sym(glue("ab{i}")) := pmap_int(foo %>% select(num_range("a", i), num_range("b", i)), ~ sum(c(...))))
}
x
}
library(magrittr)
library(tidyverse)
library(glue)
#>
#> Attaching package: 'glue'
#> The following object is masked from 'package:dplyr':
#>
#> collapse
foo <- tibble(a1 = 1:2, b1 = 3:4, a2 = 5:6, b2 = 7:8)
fn_rowwise(foo)
#> # A tibble: 2 x 6
#> a1 b1 a2 b2 ab1 ab2
#> <int> <int> <int> <int> <int> <int>
#> 1 1 3 5 7 4 12
#> 2 2 4 6 8 6 14
fn_pivot(foo)
#> # A tibble: 2 x 6
#> a1 a2 b1 b2 ab1 ab2
#> <int> <int> <int> <int> <int> <int>
#> 1 1 5 3 7 4 12
#> 2 2 6 4 8 6 14
fn_pmap(foo)
#> # A tibble: 2 x 6
#> a1 b1 a2 b2 ab1 ab2
#> <int> <int> <int> <int> <int> <int>
#> 1 1 3 5 7 4 12
#> 2 2 4 6 8 6 14
bench::mark(fn_rowwise(foo), fn_pivot(foo), fn_pmap(foo))
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fn_rowwise(foo) 1.08ms 1.28ms 554. 2.73KB 17.7
#> 2 fn_pivot(foo) 18.02ms 20.14ms 37.8 100.27KB 27.5
#> 3 fn_pmap(foo) 3.26ms 3.86ms 200. 11KB 18.8