Need help using tidyeval with dplyr

dplyr
tidyeval

#1
# Problem:  I am trying to use dplyr  within a function
#           call where the function parameters are a dataframe
#           name, and variable names within the dataframe.  The
#           function needs to accomodate different dataframes and 
#           variable names so that it is generalized for use with
#           any dataframe.  
#
#           Here is my simple example:


# Load tidyverse library
  require(tidyverse)
#> Loading required package: tidyverse
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Loading tidyverse: dplyr
#> Conflicts with tidy packages ----------------------------------------------
#> filter(): dplyr, stats
#> lag():    dplyr, stats
  
  # Example:   summing columns of a dataframe.  This example uses 
  #            mutate and passes unquoted variable names
  
  # Data
  df.out <- data.frame(x=1:50, y=100:51)
  
  # Function
  df.function <- function(df, var1, var2){
    var1 <- enquo(var1)
    var2 <- enquo(var2)
    # create a third variable that is a sum of the
    # first two
    df.new <- df %>% mutate(z = UQ(var1) + UQ(var2))
    return(df.new)
  }
  
  # Function Call
  df.augmented <- df.function(df.out, x, y)
  df.augmented
#>     x   y   z
#> 1   1 100 101
#> 2   2  99 101
#> 3   3  98 101
#> 4   4  97 101
#> 5   5  96 101
#> 6   6  95 101
#> 7   7  94 101
#> 8   8  93 101
#> 9   9  92 101
#> 10 10  91 101
#> 11 11  90 101
#> 12 12  89 101
#> 13 13  88 101
#> 14 14  87 101
#> 15 15  86 101
#> 16 16  85 101
#> 17 17  84 101
#> 18 18  83 101
#> 19 19  82 101
#> 20 20  81 101
#> 21 21  80 101
#> 22 22  79 101
#> 23 23  78 101
#> 24 24  77 101
#> 25 25  76 101
#> 26 26  75 101
#> 27 27  74 101
#> 28 28  73 101
#> 29 29  72 101
#> 30 30  71 101
#> 31 31  70 101
#> 32 32  69 101
#> 33 33  68 101
#> 34 34  67 101
#> 35 35  66 101
#> 36 36  65 101
#> 37 37  64 101
#> 38 38  63 101
#> 39 39  62 101
#> 40 40  61 101
#> 41 41  60 101
#> 42 42  59 101
#> 43 43  58 101
#> 44 44  57 101
#> 45 45  56 101
#> 46 46  55 101
#> 47 47  54 101
#> 48 48  53 101
#> 49 49  52 101
#> 50 50  51 101
  
  # Question: My code seems overly complicated in 
  #           terms of converting the unquoted input parameters to
  #           quoted values using enquo, and then unquoting again
  #           using UQ in the call to mutate.  It is the only way
  #           I could get this work for arbitrary dataframe and variable
  #           names.  Is there a way to do this without using enquo,
  #           and UQ ??????
  #
  # Thanks

#2

Know that the !! operator is the equivalent to UQ (see here) to enquote.

You could then write mutate(z = !!var1 + !!var2)

Otherwise your code seems ok. It is coherent with the vignette programming with dplyr: quote to get a quosure inside a function with enquo and then unquote with !! when needed.


#3

Another way to extend this a bit / a different approach includes using the ellipsis and quos(). This quick example uses the filter and rowSums, but I would be interested in knowing if this could be collapsed down to one function call and avoid using filter.

library(tidyverse)

df <- data.frame(x=1:50, y=100:51)
df2 <- data.frame(x=1:50, y=100:51, z=101:150)

# Function
df.function <- function(df, ...) {
  vars <- quos(...)
  
  # create a third variable that is a sum of the input
  df.new <- df %>% 
    filter(!!!vars) %>%
    mutate(z1 = rowSums(.))
  
  return(df.new)
}

df.result <- df.function(df, x, y)
df.result2 <- df.function(df2, x, y, z)

#4

While attempting to generalize to more than 2 columns, you can avoid using filter I think. Usefool tidyeval tools are quos(...), and ``!!!to enquote it. You can use that insidedplyrfunction likeselect`. Here is how I would do it:

library(dplyr, warn.conflicts = F)

df1 <- data_frame(x=1:50, y=100:51)
df2 <- data_frame(x=1:50, y=100:51, z=101:150)

rowsum_df <- function(df, ...) {
  var <- quos(...)
  df %>% 
    mutate(z1 = select(., !!!var) %>% rowSums())
}

rowsum_df(df1, x, y)
#> # A tibble: 50 x 3
#>        x     y    z1
#>    <int> <int> <dbl>
#>  1     1   100   101
#>  2     2    99   101
#>  3     3    98   101
#>  4     4    97   101
#>  5     5    96   101
#>  6     6    95   101
#>  7     7    94   101
#>  8     8    93   101
#>  9     9    92   101
#> 10    10    91   101
#> # ... with 40 more rows
rowsum_df(df2, x, y)
#> # A tibble: 50 x 4
#>        x     y     z    z1
#>    <int> <int> <int> <dbl>
#>  1     1   100   101   101
#>  2     2    99   102   101
#>  3     3    98   103   101
#>  4     4    97   104   101
#>  5     5    96   105   101
#>  6     6    95   106   101
#>  7     7    94   107   101
#>  8     8    93   108   101
#>  9     9    92   109   101
#> 10    10    91   110   101
#> # ... with 40 more rows
rowsum_df(df2, x, y, z)
#> # A tibble: 50 x 4
#>        x     y     z    z1
#>    <int> <int> <int> <dbl>
#>  1     1   100   101   202
#>  2     2    99   102   203
#>  3     3    98   103   204
#>  4     4    97   104   205
#>  5     5    96   105   206
#>  6     6    95   106   207
#>  7     7    94   107   208
#>  8     8    93   108   209
#>  9     9    92   109   210
#> 10    10    91   110   211
#> # ... with 40 more rows

#5

Thanks for the reply…

Yes, I did realize that you can use !! in place of UQ(). I only did it the longer way to avoid confusing the negation operator with UQ(). I can see myself forgetting the distinction later on down the road when I am trying to explain my function to others.


#6

Thanks for your reply.

Using the ellipsis seems like a good option. But then how would I access the individual variable names that are passed into … ?

In actual practice i would need to be able to test for values of the individual variables passed into …


#7

Bear in mind that functions should only be as complicated as necessary. Instead of taking a data.frame and column names, then looking for those columns in the data, just ask for the vectors.

vector_function <- function(var1, var2){
  var1 + var2
}

df.augmented <- df.out %>%
  mutate(z = vector_function(x, y))

Non-standard evaluation is only necessary for extremely general functions. And even then, a vector-input function is often the better choice in keeping things general. With the example above, df.function would overwrite any existing z column, while vector_function allows the user to specify the column name.

Sometimes, narrowly focused functions could benefit from NSE if they exploit the fact that it’s not evaluated. For example, the dbplyr package allows statements to be executed in a database instead of R because it catches arguments before evaluation.

Unless there’s a benefit to using NSE, see if you can make the function vector-based. They’re simpler to write, debug, understand, and use inside other functions.