Should tidyeval be abandoned?

Maybe this can help with the mental model for :=.

You can’t do any funny business on the left hand side of a definition that uses an = sign (like f(arg = value)). It needs to be a valid name. But we want to create a name programmatically by evaluating an expression on the left hand side and using the resulting value as the name. := provides an alternative equal sign operator for this purpose.

6 Likes

That works nicely, @tjmahr!

:+1: I think this is an inherent challenge with anything "new"— including when it's dead on (:point_down:) , there are a limited number of explanations out there because, well, it's new. Even if everyone ends up with a shared mental model, the analogues and examples that get us there will always vary (that whole Humboldt's dictum jam, "in order to understand we must already in some sense have understood").

I'd encourage everyone to keep on writing up their tidyeval foibles and successes. Not only does it solidify the information in your mind, but it's super helpful to read, too! (So, basically, I'm selfish.)

2 Likes

I previously used the sqldf package to select fields and summarise data frames, and now for the most part use dplyr for its versatility. I work for an emergency service and report at a number of different levels within the organisation - from individual fire station level at the most basic to whole-country at the top, with several other levels in between.

In sqldf it is straightforward to generalise a function selecting some form of aggregate - a simple count of all incidents attended say - by passing the grouping level to the function as a string. As SQL statements in sqldf are simple strings so it is easy to use standard string composition techniques to generate the required group by statements.

I found it much more difficult to generalise similar aggregation functions in dplyr more recently. I read and re-read the dplyr programming guide referred to earlier in this thread. Eventually I was able to implement a generalised function where I passed the names of the fields to group by in unquoted form after much trial and error of the techniques listed. Compared to the ease with which I can manipulate field names in a SQL statement I found the whole approach much more difficult, conceptually and in practice. The end result works just fine, and allows me to apply the same function at different grouping levels in dplyr, but for a while I thought I was going to be better off going back to sqldf, or using a mix of two.

I've more than thirty years of programming experience in a number of languages, and have been a database developer in SQL for just as long. I can honestly say that working out how to use quosures to pass column names to dplyr functions has been one of the most difficult tasks I've tackled in the last few years. It was not intuitive for me, and doing away with the string-variants of the dplyr functions certainly made implementing functions doing no more than summary aggregations at different levels more difficult than it needed to be.

5 Likes

My sense, personally, is that tidyeval is in a better place for sophisticated package developers and a bit hard to reason about for function authors a level of abstraction “up” from there. I think this is true of the documentation as well.

I have found that things like as.symbol are very useful, but it turns out other scenarios I thought I needed SE for I could use functions like select_at.

Ultimately, I hope that overtime we get a few more common paths outlined and simplified or go back to having SE convenience versions of common dplyr and tidyr functions. I feel like the current “tidyeval cookbook” either doesn’t exist or uses scenarios and examples that don’t align with my common tasks.

4 Likes

I think standard evaluation of versions of dplyr verbs that take a list as input would be pretty nice. That would make it easy to program with, without needing any fancy stuff, and it would be pretty flexible also.
Passing in stuff like everything() would work pretty natural since you can just put the function in the list.

Does .data itself cause an R CMD check note? If so, do you have to import a function from rlang?

I ask as when I use this, I see notes for no visible binding for global variable ‘.data’ and Undefined global functions or variables: .data.

Yes, you’ll need to import .data from rlang

1 Like

Just since it has not been mentioned, I think tidyeval really shines when you get to something complex like NSE for summarize. Although it is probably best to avoid awful column names, it is a simple example of the types of things that can make executing summaries over a vector quite difficult when programming. In this example, I tend to think of tidyeval being head-and-shoulders above lazyeval and my code being much simplified; I am quite grateful. Despite the learning curve, tidyeval works naturally in these types of examples and because the syntax is not R-standard, it correctly masks whatever I throw at it.

library(dplyr)

mydata <- data_frame('AwF~uL%C-olumn'=c(1,2,3,4,5))

# Summarize naturally
mydata %>% summarize(min=min(`AwF~uL%C-olumn`),sum=sum(`AwF~uL%C-olumn`))

# Programmatically
compute_def <- c(min=quo(min(`AwF~uL%C-olumn`)),sum=quo(sum(`AwF~uL%C-olumn`)))

mydata %>% summarize(!!!compute_def)
1 Like

But with dplyr being an essential workhorse package, and the general guidance that copy/pasting code more than 2 or 3 times means you should be thinking about creating functions, its going to become much more "typical" for a wider range of users with less programming experience.

2 Likes

As a function-writing human who is not a "programmer" (per se) and is still kind of on Bambi legs with tidyeval, I'd posit that most of these functions don't actually involve the more nuanced aspects of tidyeval/NSE etc…

n = 1, just one jackanapes' opinion!

5 Likes

That's what I was going to mention. There are several common levels of "functionizing" tidyverse analysis that I see:

  1. Much code re-use can use a function that assumes a standard name for a column. Input files are often standardized, and when they aren't, changing column names are typically the least of your worries. This doesn't require any tidyeval, since the column name can just be used as-is.
  2. For those cases where the column names are now changing, all you really have to learn for most cases is quo/enq and UQ/!!!. I think this can be summarized as "keep the name from evaluating until later, then evaluate it with the right context." I expect that people will generally wrap their head around this.
  3. I won't try to guess what the most common next-level scenario is (such as dealing with ..., or extracting strings from column names with quo_name, or something else). What I will say is that many of these uses will generally follow "cookbook" style programming, so as long as the most common scenarios are included in the documentation (and slightly less common scenarios are covered on StackOverflow), people should be able to incorporate them about as easily as the "user-friendly" tidyeval alternatives.
  4. Going deeper than this, you are definitely no longer a "typical user", as @jennybryan stated. If you've followed this list down the rabbit hole, however, you have probably already developed some intuition about tidyeval based on the previous levels. Some of that intuition will need correcting, but good documentation can provide that.
5 Likes

Tidy eval seems daunting and non-intuitive at first but it is actually very simple (though simple does not mean easy). It is rewarding to learn it because once it clicks you will be able to solve many problems using a few simple concepts.

In a way it's like learning how to program with functions or how to use functions as arguments or return values. For a beginner this is a big learning step but an important one. Once you get functional programming you can solve a whole range of problems with elegant solutions. Tidy eval has that same kind of power.

8 Likes

I've gotten my head around basic tidy eval and use it regularly now. The most common place I see people running into challenges is where there are competing quotation systems. Typically this means strings vs quosures.

When I store a list of parameters in open code I now store them as quosures rather than strings:

# group_by is happy to accept a list of quosures
sepaldims <- quos(Species, Sepal.Length)
iris_sepal <- iris %>%
  group_by(!!!sepaldims) %>%
  summarize(sepal_width = mean(Sepal.Width), petal_width = mean(Petal.Width))

These even allow modification without re-specifying parameters:

petaldims <- quos(!!!sepaldims, Petal.Length)
iris_petal <- iris %>%
  group_by(!!!petaldims) %>%
  summarize(sepal_width = mean(Sepal.Width), petal_width = mean(Petal.Width))

But tidyr::left_join still expects a list of strings: by = c("Species", "Sepal.Length"). If I want to supply these programatically the best solution I found was by = sapply(sepaldims, quo_text). Consider this a plug for abstracting quo_text to lists of quosures.

@stewart.ross mentioned SQL where we often pass parameters as a string with comma-separated values. In code that feeds an SQL query I try not to fight this and simply pass strings. If I read parameters from a table I end up with a list of strings. To use this in tidy eval functions I rely on syms() to convert to a list of symbols and !!!syms() within the function call.

I figured out how to write functions that accept bare expressions: col <- enquo(col) enables mtcars %>% count(!!col). But then you realize that ggplot(aes((!!col), n)) does not work (yet). One solution is ggplot(aes_string(quo_text(col), "n")) but this can get verbose if there are multiple parameters. I think this is where most people end up because aes_string was a good work-around when using lazyeval elsewhere and passing parameters as strings. It took a while before I realized that you can simply use ggplot(aes_(col, ~n)). I think once you get your head wrapped around NSE you forget to use SE.

Overall, Programming with dplyr does a great job explaining how to live within a tidy eval universe. It's both conceptual and a cookbook. It would help if the recipes included more on how to interface with alternate quoting systems. And any caveats about practices to avoid (like passing around lists of strings?).

2 Likes

What is cool about R and the tidyverse in general is that you can write code that is easy to read and easy to explain.
The only exception would be tidyeval. I wouldn't dare question that it's the right theoretical approach, but I wouldn't be caught in a million years trying to explain it to a colleague either (assuming I understood it).
I don't think that better documentation is the solution.
My personal opinion is that dplyr was built for quick analysis, and it is the absolute best tool at it. For programming, we would need a package built for that purpose.

4 Likes

What is cool about R and the tidyverse in general is that you can write code that is easy to read and easy to explain.
The only exception would be tidyeval

Highlighting this point was the key reason for this thread. (OP here)

I don’t think that better documentation is the solution

So far, I have to agree.

So, my only recommendation would be to label tidyeval as beta and continue working to make the syntax easier.

If we set in stone the current iteration, then my opinion is that we will be “stuck” with something that could perhaps be better.

The other reason for this thread was to see what the community thinks, which I am happy to see above.

1 Like

My gut feeling is to agree with @Artichaud1 and @pavopax.

The main selling point of the various tidyverse packages is how intuitive and easily understandable they are. tidyeval is the opposite with the strange functions, nevermind !! and !!!.

It may well be that the vast majority of users never have to use this syntax, but a great number will be put off even trying to learn it in its current form.

2 Likes

@pavopax @martin.R @Artichaud1

I would be very interested to hear your opinion on this draft of the new programming vignette: http://rpubs.com/lionel-/programming-draft

It takes a different approach and starts with quasiquotation and construction of R expressions. Hopefully this is a more intuitive way of explaining tidy eval.

11 Likes

Tremendous effort, very well done. I was able to follow...I think you made it as accessible as possible.

I mean, for now, it could be that tidyeval is the way to go.

But speaking only for myself, I do not need to know about quosures, quasiquosures and all that stuff. I totally understand that it is beautiful and keeps the verse tidy and pure, and I admire the genius of it, but at the end of the day I wont be using it...it distracts me too much from my purpose.

3 Likes

I've been taking a few swings at the current version of the programming vignette, and comparing with the docs for substitute has really helped as a basis for comparison (since, according to the vignette, enquo serves a similar purpose).

I'm not a super big fan of the names quo and enquo, since they're not very descriptive, but I think the bigger problem with this section ("Different expressions") is that it introduces a bunch of functions like quo that ultimately turn out not to be the solution. I understand that the vignette is trying to pose this as a problem-solving opportunity where the reader methodically introduces different possible methods, but it's frustrating to find out at the end that some of the tools I was just introduced to don't actually contribute to the solution at all.

EDIT: I'm just going through the draft you posted, @lionel, and I think it's really strong so far. The first two sections have given me a good understanding of how NSE works in base R, and the examples (especially the ones using eval with different contexts) show really clearly how this stuff can be used by dplyr to achieve its magic.

The Quasiquotation sections feel much better set up. The previous tools, using quote and eval, had been already been introduced successfully, and when you introduced !!, I understood that clearly too—well enough that, when you then tried to use it in the following example, I found myself thinking, "But that's going to give us a string, not a symbol..." about two sentences before you wrote it. So I think this is on the right track :smile: