Iterative work on many columns: alternatives to !!sym(glue( := !!sym(glue( ?

courtiol · October 10, 2019, 6:54am

Hi all,

I have to repeat many times some operations that iterates over many columns.
The data frame I need to work on contains thousands of columns .

Here is a simplified example:

library(magrittr)
library(tidyverse)
library(glue)

foo <- tibble(a1 = 1:2, b1 = 3:4, a2 = 5:6, b2 = 7:8)

for(i in 1:2) {
  foo %<>%
    mutate(!!sym(glue("ab{i}")) := !!sym(glue("a{i}")) + !!sym(glue("b{i}")))
}
foo
#> # A tibble: 2 x 6
#>      a1    b1    a2    b2   ab1   ab2
#>   <int> <int> <int> <int> <int> <int>
#> 1     1     3     5     7     4    12
#> 2     2     4     6     8     6    14

foo <- tibble(a1 = 1:2, b1 = 3:4, a2 = 5:6, b2 = 7:8)

This works very well and it is pretty efficient, but I wonder if I miss a high level function from {tidyselect} or elsewhere doing that exact job?

I use the code to communicate with others and I find the syntax !!sym(glue( := !!sym(glue( a little too intimidating for the folks I communicate with my folks...

Any suggestions?

I tried alternatives such as the following (which I failed to convert into a mutate_at) but I could not find anything convincing...

for(i in 1:2) {
  foo %<>%
    mutate(!!sym(glue("ab{i}")) := pmap_dbl(foo %>% select(num_range("a", i), num_range("b", i)), ~ sum(c(...))))
}
foo
#> # A tibble: 2 x 6
#>      a1    b1    a2    b2   ab1   ab2
#>   <int> <int> <int> <int> <dbl> <dbl>
#> 1     1     3     5     7     4    12
#> 2     2     4     6     8     6    14

Many thanks,

Alex

andresrcs · October 11, 2019, 3:21am

I don't have a solution for you but a question instead, I know this could be computationally expensive but It wouldn't be better at long term to extract the information encoded into the variable names? I have the suspicion you have oversimplified your task and I'm not seeing the full picture.

library(tidyverse)

foo <- tibble(a1 = 1:2, b1 = 3:4, a2 = 5:6, b2 = 7:8)

foo %>% 
    pivot_longer(
        everything(),
        names_to = c(".value", "index"),
        names_pattern = "(.)(.)"
    ) %>% 
    mutate(ab = a + b)
#> # A tibble: 4 x 4
#>   index     a     b    ab
#>   <chr> <int> <int> <int>
#> 1 1         1     3     4
#> 2 2         5     7    12
#> 3 1         2     4     6
#> 4 2         6     8    14

courtiol · October 11, 2019, 7:27am

Thanks @andresrcs,

In general, on other datasets, I am a big fan of pivoting things to get tidy data but here the situation is a little more tricky.

First, the datasets I am dealing with can be quite large (could be a few GB) and they are roughly squared (as wide as long).

Second (although not as crucial), I would prefer not to change the data structure since other people are working on them using different tools. So calls to pivot_longer would probably involve pivot_wider counter calls.

For these reasons (but mostly the first one), pivoting does not seem to be a good option and I would rather stick to an iterative row-wise approach if possible. I am just looking for a solution that is visually pleasing and that does not look too low level.

FYI, I have compared the 3 approaches below.

Admittedly, the comparison may be a bit biased since I was not able to figure out a way to unpivot things properly without triggering warnings, so perhaps it could be done more efficiently. In any case, the cost triggered by pivoting is huge both in terms of CPU and memory footprint (and the code does not become more simple IMO):


options(warn = -1)
  
fn_pivot <- function(x){
  x %>% 
    pivot_longer(
      everything(),
      names_to = c(".value", "index"),
      names_pattern = "(.)(.)"
    ) %>% 
    mutate(ab = a + b) %>%
    pivot_wider(
      names_from = index,
      values_from = - index,
      names_sep = "") %>%
    unnest(everything())
}

fn_rowwise <- function(x) {
  for(i in 1:2) {
    x %<>%
      mutate(!!sym(glue("ab{i}")) := !!sym(glue("a{i}")) + !!sym(glue("b{i}")))
  }
x
}

fn_pmap <- function(x) {
  for(i in 1:2) {
    x %<>%
      mutate(!!sym(glue("ab{i}")) := pmap_int(foo %>% select(num_range("a", i), num_range("b", i)), ~ sum(c(...))))
    }
  x
}
  
library(magrittr)
library(tidyverse)
library(glue)
#> 
#> Attaching package: 'glue'
#> The following object is masked from 'package:dplyr':
#> 
#>     collapse

foo <- tibble(a1 = 1:2, b1 = 3:4, a2 = 5:6, b2 = 7:8)

fn_rowwise(foo)
#> # A tibble: 2 x 6
#>      a1    b1    a2    b2   ab1   ab2
#>   <int> <int> <int> <int> <int> <int>
#> 1     1     3     5     7     4    12
#> 2     2     4     6     8     6    14
fn_pivot(foo)
#> # A tibble: 2 x 6
#>      a1    a2    b1    b2   ab1   ab2
#>   <int> <int> <int> <int> <int> <int>
#> 1     1     5     3     7     4    12
#> 2     2     6     4     8     6    14
fn_pmap(foo)
#> # A tibble: 2 x 6
#>      a1    b1    a2    b2   ab1   ab2
#>   <int> <int> <int> <int> <int> <int>
#> 1     1     3     5     7     4    12
#> 2     2     4     6     8     6    14

bench::mark(fn_rowwise(foo), fn_pivot(foo), fn_pmap(foo))
#> # A tibble: 3 x 6
#>   expression           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 fn_rowwise(foo)   1.08ms   1.28ms     554.     2.73KB     17.7
#> 2 fn_pivot(foo)    18.02ms  20.14ms      37.8  100.27KB     27.5
#> 3 fn_pmap(foo)      3.26ms   3.86ms     200.       11KB     18.8

system · November 1, 2019, 7:41am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.