How to properly and dynamically use variables in dplyr pipelines?

pomchip · January 6, 2023, 2:56am

Hi,

The following reprex simplifies the issue I am having while working on a fairly complex function. The function has an argument which accepts a dynamic list (in the general sense, not the R sense) of one or more unquoted variables. Following a necessary step of validation and transformation, this "list" of variables becomes a character vector. This vector must be used for various summarization of data. If the vector contains a single variable, using !!sym() works... but not when there are more.
How would you suggest to properly handle this situation?

Thanks

require(tidyverse)
set.seed(12345)
df <- data.frame(
  a = rnorm(100),
  x = sprintf('x%s', sample(1:3, 100, replace = TRUE)),
  y = sprintf('y%s', sample(1:3, 100, replace = TRUE)),
  z = sprintf('z%s', sample(1:3, 100, replace = TRUE))
)

f <- function(data, by){
  
  # Assume that by must be validated/filtered and is ultimately transformed into
  # a character
  by <- data %>% dplyr::select( {{ by }} ) %>% names()
  
  data %>% 
    dplyr::group_by( !!sym(by) ) %>% 
    dplyr::summarize(
      mean = mean(a)
    )
  
}

df %>% f(by = z)
df %>% f(by = c(y, z))

AyushBipinPatel · January 6, 2023, 6:58am

Hello @pomchip ,

I think I was able to work out what you need from this blog (Bang Bang – How to program with dplyr | R-bloggers). Below is the code that gets the desired output:

> set.seed(12345)
> df <- data.frame(
+   a = rnorm(100),
+   x = sprintf('x%s', sample(1:3, 100, replace = TRUE)),
+   y = sprintf('y%s', sample(1:3, 100, replace = TRUE)),
+   z = sprintf('z%s', sample(1:3, 100, replace = TRUE))
+ )
> 
> f <- function(data,...){ # need to use ... as we need arbitrary number of vars to group by
+   
+   # Assume that by must be validated/filtered and is ultimately transformed into
+   # a character
+   by <- enquos(...,.named = T) #enriched quotations 
+   
+   data %>% 
+     dplyr::group_by( !!!by ) %>%  # big bang instead of bang-bang
+     dplyr::summarize(
+       mean = mean(a)
+     )
+   
+ }
> 
> df %>% f( z)
# A tibble: 3 × 2
  z      mean
  <chr> <dbl>
1 z1    0.190
2 z2    0.439
3 z3    0.176
> df %>% f( y,z)
`summarise()` has grouped output by 'y'. You can override using the `.groups` argument.
# A tibble: 9 × 3
# Groups:   y [3]
  y     z       mean
  <chr> <chr>  <dbl>
1 y1    z1     0.262
2 y1    z2     0.677
3 y1    z3     0.224
4 y2    z1    -0.138
5 y2    z2    -0.200
6 y2    z3     0.112
7 y3    z1     0.505
8 y3    z2     0.620
9 y3    z3     0.169
>

all credit to the solution to the blog link posted in the beginning, it was easy to build the solution using the blog. It has the exact same example that we need in your case.

Hope this helps
Ayush

pomchip · January 6, 2023, 10:21am

Thanks @AyushBipinPatel for your input

I read about and tried to use quosures. However, I could not find a way to perform the necessary validation and manipulation of the by argument once it is transformed into a quosure. For instance, how would one filter or re-order the "list" of variables inside the function (It is important to me that this re-ordering is performed inside the function rather than rely on the use to enter the argument in this order)?

For instance, assuming that alphabetic order is necessary (which is an over-simplification of the re-ordering that I need to perform), how one ensures that df %>% f(by = c(z, x, y)) return the same output as df %>% f(by = c(x, y, z)?

nirgrahamuk · January 6, 2023, 11:12am

the changes from what you had are minal; simply change
!!sym(by) to !!!syms(by)
the by names can be sorted beforehand.
see:

require(tidyverse)
set.seed(12345)
df <- data.frame(
  a = rnorm(100),
  x = sprintf('x%s', sample(1:3, 100, replace = TRUE)),
  y = sprintf('y%s', sample(1:3, 100, replace = TRUE)),
  z = sprintf('z%s', sample(1:3, 100, replace = TRUE))
)

f <- function(data, by){
  
  # Assume that by must be validated/filtered and is ultimately transformed into
  # a character
  by <- data %>% dplyr::select( {{ by }} ) %>% names() %>% sort()
  
  data %>% 
    dplyr::group_by( !!!syms(by) ) %>% 
    dplyr::summarize(
      mean = mean(a)
    ) %>% ungroup()
  
}

suppressMessages(
  identical(
    df %>% f(by = c(y, z)),
    df %>% f(by = c(z, y))
  )
)

pomchip · January 6, 2023, 2:02pm

Thanks @nirgrahamuk !

I was so close

system · January 13, 2023, 2:02pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.