When non-standard should evaluation be used and why?

When using the separate() function from tidyr with colleagues who were new to the tidyverse (and R), I tried to explain why its arguments are provided the way the way and became curious about when non-standard evaluation should be used (in functions) and why.

With tidyr::separate(), for example, the column to be separated (the argument col) is provided without quotations, whereas the columns the column to be separated into are provided in a character vector:

library(tidyr)
library(dplyr, warn.conflicts=F)
df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))

df
#>      x
#> 1 <NA>
#> 2  a.b
#> 3  a.d
#> 4  b.c
df %>% separate(x, c("A", "B"))
#>      A    B
#> 1 <NA> <NA>
#> 2    a    b
#> 3    a    d
#> 4    b    c

I don't think this is idiosyncratic only to separate(), though maybe it is and there is a unique reason why.

I thought one reason may be that the column to be separated exists in the data frame, whereas the columns that are to be separated into new columns do not exist (yet), and so that may be why the new column names are provided in a vector. However, for other functions, like select() and mutate() in dplyr, the new names for the new variables / columns are provided without quotations, i.e. dplyr::mutate(iris, Sepal.Area = Sepal.Length * Sepal.Width).

I ask in part out of curiosity and also because I would like to be consistent with use of non-standard evaluation by others and its use in tidyverse packages. I also ask because while there are good discussions and resources around the why of non-standard evaluation (via tidyeval) and the how, I am less familiar with tips on the when.

Thank you for your pointers or feedback.

1 Like

I don't know why NSE isn't used for the column names but here is a proof of concept. the function taa, that at least it would be possible to use the non quoted column names instead of a character vector.

suppressPackageStartupMessages(library(tidyverse))

df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))

taa <- function(df, x, ...) {
    q <- rlang::quos(...)
    l = length(q)
    values <- vector(mode="character")
    for(i in 1:l) {
        qq <- q[[i]]
        qqs <- as.character(qq[[2]])
        values <- c(values, qqs)        
    }
    separate(df, x, values)
}
df %>% taa(x, A, B)
#>      A    B
#> 1 <NA> <NA>
#> 2    a    b
#> 3    a    d
#> 4    b    c
1 Like

From what I understand, you are correct that names of non-existent columns are generally provided as strings, while names of existing columns are provided as bare names. The mutate/select/other assignment-type exception is that you cannot have a string on the left side of the = assignment operator, so it must be provided as a bare name. That part is consistent with R in general, and getting around that would likely require a new operator, like the := operator used in tidyeval.

The other somewhat confusing function is gather. From the documentation for the key and value arguments:

I've been trying to remember to avoid doing that in my own code.

1 Like

separate's to parameter is mostly a matter of practicality, in that it needs to take a vector of a length of more than one, and that's hard to do with tidy eval unless you use a helper function more complicated than c. Really the names could be passed to the ... parameter, which is currently defunct, but I assume historically it had a purpose and thus couldn't be used for names.

The dplyr::select (and company) usage is consistent with base R, where parameter names never need to be quoted, e.g.

c(foo = 1)
#> foo 
#>   1

list(bar = 2)
#> $bar
#> [1] 2

data.frame(baz = 3)
#>   baz
#> 1   3

The tidyr NSE behavior of taking unquoted strings for not-yet-existent variables is syntactically problematic as they look like variables but don't refer to anything, but won't change soon.

I'm probably missing something here but I don't see how passing a bare symbol name, i.e. that doesn't refer to anything, like this

foo(a)

where a doesn't exist in context, is conceptually no different than

a <- NA
foo(a)

In both cases a doesn't refer to anything... the NA indicates that the value of a is "not available'. The only difference is how you test to see if a refers to something.

I understand that keeping some kind of consistency in an api is important but once you go to NSE it seems like that cause is lost. Also with packages coming from so many places I just don't see how in practice a consistent api will ever exist anyhow. But, again, maybe I am missing something in this picture...