Passing named list to mutate (and probably other dplyr verbs)

dplyr

#1

Hi,

I want to write a function that is given a named list which is then passed on to mutate() in a way that each element of the list is an argument to mutate(). I cannot get this right, either with the new quotation/quasi-quotation syntax or with the old mutate_() and would appreciate some help.

Small example:

foo <- function(x, args) {
  args <- enquote(args)
  mutate(x, UQS(args))
}
foo(mtcars, args=list(cyl2=cyl*2))
foo <- function(x, args) {
  mutate_(x, .dots=args)
}
foo(mtcars, args=list(cyl2=cyl*2))

In both cases I get object 'cyl' not found when cyl exists in mtcars. I suppose the expression is not evaluated in the correct environment but I am not sure why.

PS: I know that I could just use foo <- function(x, ...) and then pass ... to mutate() but in the real scenario, I need to use ... for something else.

PPS: I cannot know in advance the names and expressions that will be passed on to mutate so I cannot quote them separately (as in the examples in vignette("programming")).

Thanks in advance!


#2

Though I had solution but I didn’t


#3

enquo is the right direction I think, but that causes the cyl2 to be a list column that repeats the vector cyl * 2 for each row. I believe it’s evaluating the cyl * 2 too early, turning the mutate call into mutate(mtcars, cyl2 = [long vector]). The following works, but you have to do the quo up front:

foo <- function(x, args) {
  mutate(x, !!!(args))
}

foo(mtcars, args=list(cyl2 = quo(cyl*2)))

There’s a right way to do this, but getting the list to evaluate in the right context at the right time is eluding me at the moment.


#4

I spent way too much time on this. The key for me here was adding quo(element_in_the_list) for each element in the list. Essentially it iterates through the args list without evaluating the elements. In the example cyl*2 is replaced with quo(cyl*2) then evaluated to become a quosure. Then it works like the programming vignette.

library(rlang)
library(dplyr)
library(tibble)

mtcars <- as.tibble(mtcars)

foo <- function(x, args) {
  args2 <- rlang::enquo(args)
  args_quoed <- rlang::lang_args(args2) %>% 
    purrr::map(~expr(quo(!!.x)) %>% eval_tidy)
  mutate(x, !!! args_quoed)
}

foo(mtcars %>% group_by(gear), args = list(cyl2=cyl*2, y = mean(carb)))

foo(mtcars, args = list(cyl2=cyl*2, y = mean(carb)))


#5

This is all great! Thanks everyone and @davis in particular! So, based on the various answers, the simplest syntax I could come up with is

foo <- function(x, args) {
  library("rlang")
  library("dplyr")
  a <- enquo(args) %>%
    lang_args() %>%
    lapply(function(x) {quo(UQ(x))})
  mutate(x, UQS(a))
}
foo(mtcars, args=list(cyl2=cyl*2, y=mpg/2))

Still, since args is a list, I feel there should be a way to use lapply() or an equivalent (I am not super familiar with purrr yet) directly on it but I can never make it work. The current syntax is still somewhat convoluted.

Overall, I’ve tried to wrap my head around the quasiquotation/quosures etc. several times already and some things are still eluding me. It is a really tough paradigm!


#6

Okay I think I’ve finally figured this out completely. Don’t take this for gospel, but this is my understanding. Here is my updated solution, I’ll explain.

foo <- function(x, args) {
  args_call    <- rlang::enexpr(args)
  list_of_args <- rlang::lang_args(args_call)
  mutate(x, !!! list_of_args)
}

From what I can tell, args is not technically a list yet, it’s a promise that will create a list, but once you use lapply() (or any function that forces the evaluation of the promise) on that promise directly you force the expression cyl * 2 to attempt to be evaluated. Without being in the mutate call it won’t have the correct environment to evaluate correctly.

What we really want is to just turn the user’s call, list(cyl2=cyl*2, y=mpg/2) , into a real list of named arguments, but without evaluating the arguments. This is exactly what lang_args does to start with. It sees the call, extracts the arguments without evaluating them, and returns them to you in a real named list. That’s all we need to pass to !!!.

I’m also not sure we even need enquo() here, it seems that enexpr() works just the same because we don’t need the environment the call was created in. That’s the reasoning for using enexpr() above.

So to summarise:

  1. Capture the user’s call with enexpr().
  2. Extract the named arguments with lang_args(), turning them into a named list of arguments that can be used with !!!.

The hilarious thing is that since we never actually evaluate the list function, it can really be anything.

# This works with `c()`
foo(mtcars %>% group_by(gear), args = c(cyl2=cyl*2, y = mean(carb)))

# This works with `not_a_function()`
foo(mtcars, args = not_a_function(cyl2=cyl*2, y = mean(carb)))

#7

I figured that there was some useful functions in rlang – good job finding and deciphering them, @davis. I would strongly encourage you to post this as a Q&A to Stack Overflow (with some editing to make it a single question and single answer), or maybe let @jiho post the question and you can answer it. It doesn’t look like this situation is currently addressed there, and it shows an elegant use of rlang/tidyeval.


#8

:+1: on this being SO-worthy. I mentioned this to the inestimable @drob this past week, and he said (I’m paraphrasing here) that this should be SO-legal/fine as long as there’s no mutually-assured-up-voting ring (which would send up red flags :triangular_flag_on_post::rotating_light::oncoming_police_car:)


#9

If @jiho posts it on SO and pings me on here with the link, I’ll add everything as an answer!


#10

Thanks again @davis for the final implementation! Indeed enexpr and lang_args make it work. It’s now implemented in my function.

Regarding posting on SO, I can do it but isn’t this forum indexed by Google too? If all the good/interesting answers are reposted on SO, what would be the point for RStudio to maintain this forum instead of just directing people towards SO?


#11

It’s most natural to use …

library(rlang)
library(dplyr)

foo1 <- function(x, ...) {
  args_quo <- rlang::quos(...)
  mutate(x, !!! args_quo)
}

mtcars %>% 
  as_tibble() %>%
  group_by(gear) %>% 
  foo(cyl2 = cyl*2, y = mean(carb))

If you don’t want to do that for some reason, I’d recommend requesting the user use quos() rather than list():

foo2 <- function(x, args) {
  mutate(x, !!! args)
}

mtcars %>% 
  as_tibble() %>%
  group_by(gear) %>% 
  foo2(quos(cyl2 = cyl*2, y = mean(carb)))

That’s basically what dplyr does with funs() and vars().

Unfortunately there’s no way to retrieve the correct environments from the call to list(), so I would advise against that approach.


#12

No need to cross-post if you don’t want to, but (imho) the goal is never to replace SO— that’s a Q&A that’s not geared toward discussion, opinion, etc.

Here’s a thread on it— there’s no canonical answer, just didn’t want to take up too much space in this thread!


#13

Clearly we are all still learning here. I definitely retract my earlier statement where I said “we don’t need the environment the argument list was created in.” We do!

To expand upon Hadley’s point, let’s shoot ourself in the foot with my (not so good) approach.

library(dplyr)
library(rlang)

mtcars_tbl <- as_tibble(mtcars)

# Say we want to use this variable in the mutate call.
important_var <- 4

foo_bad <- function(x, args) {
  # But it also happens to be defined here 
  # because the function designer felt like using it
  important_var <- 5
  args_call    <- rlang::enexpr(args)
  list_of_args <- rlang::lang_args(args_call)
  mutate(x, !!! list_of_args)
}

# No environment has been captured with list(), so we don't know which
# important_var to use. I think by default mutate() then finds 
# the first one it sees while working it's way back up the function calls?
# That would be `important_var <- 5`, which is not what the user wants!

# Look how cyl2 = cyl * 5   (not cyl * 4 like we wanted)
foo_bad(mtcars_tbl, list(cyl2 = cyl * important_var))
#> # A tibble: 32 x 12
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb  cyl2
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21.0  6.00   160 110    3.90  2.62  16.5  0     1.00  4.00  4.00  30.0
#>  2  21.0  6.00   160 110    3.90  2.88  17.0  0     1.00  4.00  4.00  30.0
#>  3  22.8  4.00   108  93.0  3.85  2.32  18.6  1.00  1.00  4.00  1.00  20.0
#>  4  21.4  6.00   258 110    3.08  3.22  19.4  1.00  0     3.00  1.00  30.0
#>  5  18.7  8.00   360 175    3.15  3.44  17.0  0     0     3.00  2.00  40.0
#>  6  18.1  6.00   225 105    2.76  3.46  20.2  1.00  0     3.00  1.00  30.0
#>  7  14.3  8.00   360 245    3.21  3.57  15.8  0     0     3.00  4.00  40.0
#>  8  24.4  4.00   147  62.0  3.69  3.19  20.0  1.00  0     4.00  2.00  20.0
#>  9  22.8  4.00   141  95.0  3.92  3.15  22.9  1.00  0     4.00  2.00  20.0
#> 10  19.2  6.00   168 123    3.92  3.44  18.3  1.00  0     4.00  4.00  30.0
#> # ... with 22 more rows


# Here we are going to use quos instead of list, like Hadley advises
foo_good <- function(x, args) {
  mutate(x, !!! args)
}

# Importantly, we capture the environment where important_var is defined using
# quos(). The call to mutate() now KNOWS that it should be 4, not 5
# because the environment has been dragged along in the quosure
foo_good(mtcars_tbl, quos(cyl2 = cyl * important_var))
#> # A tibble: 32 x 12
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb  cyl2
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21.0  6.00   160 110    3.90  2.62  16.5  0     1.00  4.00  4.00  24.0
#>  2  21.0  6.00   160 110    3.90  2.88  17.0  0     1.00  4.00  4.00  24.0
#>  3  22.8  4.00   108  93.0  3.85  2.32  18.6  1.00  1.00  4.00  1.00  16.0
#>  4  21.4  6.00   258 110    3.08  3.22  19.4  1.00  0     3.00  1.00  24.0
#>  5  18.7  8.00   360 175    3.15  3.44  17.0  0     0     3.00  2.00  32.0
#>  6  18.1  6.00   225 105    2.76  3.46  20.2  1.00  0     3.00  1.00  24.0
#>  7  14.3  8.00   360 245    3.21  3.57  15.8  0     0     3.00  4.00  32.0
#>  8  24.4  4.00   147  62.0  3.69  3.19  20.0  1.00  0     4.00  2.00  16.0
#>  9  22.8  4.00   141  95.0  3.92  3.15  22.9  1.00  0     4.00  2.00  16.0
#> 10  19.2  6.00   168 123    3.92  3.44  18.3  1.00  0     4.00  4.00  24.0
#> # ... with 22 more rows

I think this is actually really important to understand in depth, so I’m thankful for this thread!