Dplyr programming vignette and dplyr


#1

The dplyr programming vignette has the following example to illustrate that the dplyer::filter function is referentially opaque

dplyr::filter(df, x == y) 

and says that depending on context it might be evaluated in any of the following ways.

(BTW I know what this means and why dplyr::filter is referentially opaque… this is a question about the example used to illustrate this point)

df[df$x == df$y, ]
df[df$x == y, ]
df[x == df$y, ]
df[x == y, ]

The point, I think, this example is supposed to show is that dplyr::filter will interpret x==y differently depending on the environment it is executed in.

But the vignette doesn’t show a use case where in fact one of last three interpretations are used. For example both of the following usages of dplyr::filter produce the same results:


library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#
#> Error in eval(expr, envir, enclos): object 'A' not found
# make a simple tibble
df1 <- tibble::tibble( c1 = c(1,2,3), c2 = c(1,2,4))
df1
#> # A tibble: 3 x 2
#>      c1    c2
#>   <dbl> <dbl>
#> 1     1     1
#> 2     2     2
#> 3     3     4

# filter for rows where c1 and c2 are the same
dplyr::filter(df1, c1 == c2,)
#> # A tibble: 2 x 2
#>      c1    c2
#>   <dbl> <dbl>
#> 1     1     1
#> 2     2     2

#
# now create a variable c2 and set to 1
# this changes the global environment 
c2 = 1

# however this call of dplyr::filter produces
# the same result as the first one
dplyr::filter(df1, c1 == c2,)
#> # A tibble: 2 x 2
#>      c1    c2
#>   <dbl> <dbl>
#> 1     1     1
#> 2     2     2

Can anyone provide a use case where dplyr::filter(df1, c1==c2) is interpreted as something other than

df1[df$x == df$y, ]

Thanks,
Dan


#2

I would assume that it only happens when the name referenced is not part of the input data frame. So, if you defined a third variable, c3 <- 1, then referenced that, it would not throw up an error about c3 not being found. The point being that, since the results will depend on the columns in the data frame, using dplyr::filter in a function where you don’t know what the input data frame looks like needs special care.


#3

Right, that would throw an error. But it wouldn’t interpret it as something other than df1[df1$c1 == df1$c2]. I think the throw an error thing is the only way it would fail, I can’t see a way for it to, for example, interpret the input as df1[df1$c1 == c2] as the vignette says it might. I’m guessing the example in the vignette was a spur of the moment thing that may not have been tested?

i.e if you have a failure don’t just show the code, show the code failing :grinning:


#4

This is what I’m talking about:

suppressPackageStartupMessages(library(tidyverse))
df1 <- tibble(c1 = c(1,2,3), c2 = c(1,2,4))
k1 <- 1
filter(df1, c1 == k1)
#> # A tibble: 1 x 2
#>      c1    c2
#>   <dbl> <dbl>
#> 1     1     1

So, filter(df1, c1 == k1) is not being interpreted as filter(df1, df1$c1 == df1$k1), as k1 is not in the data frame.


#5

Hi, is this what you’re looking for?

> library(dplyr)
> df1 <- tibble(c1 = c(1,2,3), c2 = c(1,2,4))
> c2 <- 3
> filter(df1, c1 == c2)
# A tibble: 2 x 2
     c1    c2
  <dbl> <dbl>
1     1     1
2     2     2
> filter(df1, c1 == !!c2)
# A tibble: 1 x 2
     c1    c2
  <dbl> <dbl>
1     3     4

#6

If c2 is not a column in the data-frame, then it would be interpreted like that.


#7

Yes, I see that now, thanks @nick


#8

Yes, that would do it too. I misunderstood the example. It would help if either @edgararuiz or @nick 's example were included in the vingnette

Thanks,
Dan


#9

@edgararuiz @nick

On second thought I still think the filter(c1 == c2) example in the vignette is confusing at best. It says specifically:

dplyr code is ambiguous. Depending on what variables are defined where, filter(df, x == y) could be equivalent to any of:

df[df$x == df$y, ]
df[df$x == y, ]
df[x == df$y, ]
df[x == y, ]

so the example of filter(df1, c1 = !!c2) isn’t what the vignette is talking about, nor is passing in anything but filter(c1 == c2). I still don’t see any use case where dplyr::filter(df1, c1 == c2), which is the case the vignette is talking about, can be interpreted by filter as anything but df1[df1$c1 == df1$c2,]

I don’t mean to be nit picky but when you are just learning things like this, like I am, confusing examples like this (IMHO of course) send you down rat holes trying to figure out what they are trying to show.

And then again maybe I completely missing the point…

The short beginning of the vignette that explains that the entire argument list may end up being interpreted by the function instead of R so you can’t take at face value what is in the argument list. And the parts that follow this filter(c1 == c2) example do a good job of explaining why this is the case.

So I think the filter(c1 == c2) example should just be removed from the vignette or replaced with working code that shows filter(c1 == c2) being interpreted in different ways.

But thanks agin @edgararuiz and @nick your comments are really helping me understand tidyeval and quosures.

Dan


#10

It sounds like you already have a pretty good understanding of how filter is working, but I still think that something is missing from this discussion - that all four forms described in the book can mean something different.

I’m having trouble describing, it but hopefully this code makes it a little more clear. I can’t really think of when the 4th case is actually useful (df[x == y, ], but I think the point the book is trying to make is that it is technically valid R code.

suppressPackageStartupMessages(library(dplyr))

# Case 1: df[df$x == df$y, ]
df <- data_frame(x = c(1, 2, 3), y = c(1, 2, 4))
filter(df, x == y)
#> # A tibble: 2 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     1
#> 2     2     2
identical(
  filter(df, x == y), 
  df[df$x == df$y, ]
)
#> [1] TRUE
rm(df)

# Case 2: df[df$x == y, ]
df <- data_frame(x = c(1, 2, 3), b = c(1, 5, 6))
y <- c(1, 2, 4)
filter(df, x == y)
#> # A tibble: 2 x 2
#>       x     b
#>   <dbl> <dbl>
#> 1     1     1
#> 2     2     5
identical(
  filter(df, x == y), 
  df[df$x == y, ]
)
#> [1] TRUE
rm(df, y)

# Case 3: # Case 2:  df[x == df$y, ]
df <- data_frame(a = c(0, 2, 7), y  = c(1, 2, 4))
x <- c(1, 2, 3)
filter(df, x == y)
#> # A tibble: 2 x 2
#>       a     y
#>   <dbl> <dbl>
#> 1     0     1
#> 2     2     2
identical(
  filter(df, x == y), 
  df[x == df$y, ]
)
#> [1] TRUE
rm(df, x)

# Case 4: df[x == y, ]
df <- data_frame(a = c(0, 2, 7), b  = c(1, 2, 4))
x <- c(1, 2, 3)
y <- c(1, 2, 4)
filter(df, x == y)
#> # A tibble: 2 x 2
#>       a     b
#>   <dbl> <dbl>
#> 1     0     1
#> 2     2     2
identical(
  filter(df, x == y), 
  df[x == y, ]
)
#> [1] TRUE

#11

Thanks @gergness, you’ve given me some more things to think about.

Your examples are good ones, I hadn’t thought of those cases. But they highlight a problem that is mentioned later in the vignette and I don’t think they highlight problems introduced by having the function parse the input args instead letting r do it. This “surprising” evaluation of y can happen in any function when you use an uninitialized variable when it exists in an ancestor environment of the function, for example:

b = 3
 f8 <- function(a) {
        b = b + 3
        a + b
 }
 
 f8(3)
#> [1] 9

Later the vignette shows that the use of something like filter(df, x == y) is an accident waiting to happen because if the tibble lacks a y (or x) column a y variable in context will be used instead. The vignette shows that the .data pronoun should be used to prevent a surprising evaluation of x or y, as is what is happening in some of your examples… So the vignette is using an anti-pattern to try to show how something in tidyeval works which I don’t think is a good way to introduce a concept.

filter(df, .data$x == .data$y)

In general you should use filter(df, .data$x == .data$y) if you intend to compare columns so an error will be thrown if the tibble lacks an x or y column.

The example as shown at the beginning of the vignette, filter(df, x== y), really can’t be understood by someone, nor does it give them any real information, just learning about tidyeval at that point in the explanation. It’s really an “inside baseball” example. That’s why I think it is confusing.

I think the vignette is a really good and useful intro to tidyeval, but it very hard to write up something for a new user when you all ready know all the ins and outs of a language or framework.

Dan


#12

It’s an antipattern, maybe, for programming with dplyr but not for general R usage. That non-standard evaluation is a very good feature of R and makes users’ lives much easier, but it can’t prevent users from shooting themselves in the foot.


#13

Think of it as if everything you pass to an non-standard eval (NSE) function will be evaluated in the data frame’s environment. If there’s no variable there by the specified name, the function will look in parent environment until it finds it or is forced to error out. This is how NSE has always worked in R (see ?with), and is in keeping with the lexical scoping of environments.

The new tidy eval escapes are really built for adjusting where that evaluation looks. Most of the time you only need .data if you’ve got another variable in a nested environment (usually belonging to a function) of the same name (likely a modified copy to be compared) and you need to reach up to the version in the NSE environment or to limit scoping to the NSE environment, should it matter. !! is mostly needed when writing new tidy eval functions, but can also be used to escape the NSE environment and look upwards for a variable.

Everyone could always use either .data or !!, but frankly it’s a lot of work and a waste of keystrokes for most interactive usage, just like prefixing every function with some_package::. The likelihood of an inadvertent name clash when everything is stored in data frames (as it should be!) that doesn’t error out due to type/length is negligible. That’s not the case for programmatic work (packages, the odd very-flexible Shiny app), where tools for managing scope are important.

Ultimately, the criticism is correct: what filter(df, x == y) returns depends on what df contains and possibly what the enclosing environments contain. But that’s true of any call in a scoped environment (including .GlobalEnv; run pryr::parenvs(all = TRUE)). In fact, the call already depends on scoping by naming the data frame df, which is also the name of a function in stats. Even in base R, we can still escape the function so we can write

df(1, 1, 1)
#> [1] 0.1591549

# here `c(df, df)` is a list of two copies of a function
do.call(mapply, c(df, df))
#> Error in dots[[1L]][[1L]]: object of type 'closure' is not subsettable

df <- data.frame(x = 1:3, 
                 df1 = 1:3, 
                 df2 = 1:3)

# here it's two copies of a data frame
do.call(mapply, c(df, df))
#> Error in match.fun(FUN): argument "FUN" is missing, with no default

# here it's escaped so it's a list of a function and a data frame
do.call(mapply, c(stats::df, df))
#> [1] 0.15915494 0.11111111 0.06891611

Scoping is everywhere, but almost always works without incident:

library(dplyr)

pryr::parenvs()
#>   label                         name
#> 1 <environment: 0x7ff212060b80> ""  
#> 2 <environment: R_GlobalEnv>    ""

with(data.frame(x = 1), pryr::parenvs())
#>   label                         name
#> 1 <environment: 0x7ff2129d5950> ""  
#> 2 <environment: 0x7ff212060b80> ""  
#> 3 <environment: R_GlobalEnv>    ""

data.frame() %>% {pryr::parenvs()}
#>   label                         name
#> 1 <environment: 0x7ff2108c5b78> ""  
#> 2 <environment: 0x7ff212060b80> ""  
#> 3 <environment: R_GlobalEnv>    ""

data.frame(x = 1) %>% 
    mutate(x = list(pryr::parenvs())) %>% 
    purrr::pluck('x', 1)
#>   label                         name
#> 1 <environment: 0x7ff210926710> ""  
#> 2 <environment: 0x7ff210911ce8> ""  
#> 3 <environment: 0x7ff2109076a8> ""  
#> 4 <environment: 0x7ff21099e8d0> ""  
#> 5 <environment: 0x7ff212060b80> ""  
#> 6 <environment: R_GlobalEnv>    ""

In fact, all those stacks have one extra environment compared to calling them interactively, because I evaluated everything with reprex, which evaluated everything in a clean environment.

Should new useRs be taught how tidy eval is an extension of lexical scoping? Well, no, not at first; they should be taught to stick their variable names in the place they need so they can analyze their data. When they run into a case where it actually matters, they can learn more—or just pick a new variable name and go on in blissful ignorance a while longer.


#14

Thanks @alistaire, your examples and explanations are very helpful to my understanding of NSE, tidy_eval and friends.

I hope this thread hasn’t taken up to much of anyone’s time but it has been incredibly helpful to me. Thanks to everyone who jumped on it.

I completely agree with your points about minimizing typing and reducing complexity… and that there is a difference between a user of a package and a developer of a package. Users of packages should not need an deep understanding of how the plumbing works, but (IMHO of course) developers of packages should.

My comments were in the context of me trying to figure out how to develop a package (and learn R at the same time). I just don’t see how you can develop reliable code for a package without a deep understanding of how NSE works … except maybe for simple packages that don’t use NSE.

I’m a programmer, not a statistician, and over time I’ve found that I had to “poke” at a lot of data. I’ve done lot’s of kinds of programming from about as low level as you can go to database and relational database programming…

I see R as going through a transition similar to what SQL did. Initially SQL was tightly maintained by a database admins who job was to guarantee the integrity of the database. Over time it moved into the general programming for people (like me) who had to poke at data. It’s common now to find devops who both develop applications that use a databases and who also fill the roll of database admin. This probably reduces the integrity of the databases involved but it greatly reduces app development time and increases their utility.

I think things like the tidyverse are going to accelerate R’s move into traditional programming for people, like me, who poke at data… no doubt it will lead to misapplied stats and data interpretation (like plots with 2 Y axes… but I finally understand why that’s bad) but it will also speed of the development of “data poking” app’s and their utility.

Thanks again,
Dan

P.S. the RStudio Community is a great place to learn about R… better I think than SO because the participants of the RC seem to have much more a willingness to focus on passing on knowledge than SO.


#15

I think the above has been useful too! I want to point out that most packages should not use NSE. dplyr does because fluid interactive use has been prioritized to an extreme level. It’s worth it in this case, because “poking” data is such a common task. In the absence of dplyr, I’ve seen lots of suboptimal wild-caught code where useRs refer to rows and column by number or create copies of variables in the global env, just to avoid, e.g., repeating the data frame name.

@danr, I don’t know if your first package is one of the ones that really requires NSE or not. It sure makes development harder. But if it does and you’re game … well you’re starting this game at a high difficulty level :grin:


#16

:flushed: irresistible historical aside: William Whewell (who effectively coined the term “Scientist” in 1834) offered a ~tongue-in-cheek alternative of “nature-pokers”

From Scientist: The Story of a Word

So, it’s basically just a fluke that there’s a field known as data science, rather than data nature-poking. @jennybryan, I hope this is the first step in your advocacy for the latter!

Scientist: The story of a word
Sydney Ross B.Sc. Ph.D.
Annals of Science Vol. 18 , Iss. 2,1962


#18

Is not working.

> filter(df1, c1 == !!c2)

Error in quos(...) : object 'c2' not found

packageVersion('dplyr')
[1] ‘0.7.4’


#19

Try updating to the development versions of dplyr and rlang using devtools::install_github().

suppressPackageStartupMessages(library(dplyr))
df1 <- tibble(c1 = c(1,2,3), c2 = c(1,2,4))
c2 <- 3
filter(df1, c1 == c2)
#> # A tibble: 2 x 2
#>      c1    c2
#>   <dbl> <dbl>
#> 1    1.    1.
#> 2    2.    2.

filter(df1, c1 == !!c2)
#> # A tibble: 1 x 2
#>      c1    c2
#>   <dbl> <dbl>
#> 1    3.    4.

Created on 2018-04-17 by the reprex package (v0.2.0).