Inconsistent results of 'all_of()' in dplyr 1.0.0 functions?

pieterjanvc · July 16, 2020, 2:01am

Hi there,

I was trying to answer another post on this forum where I was forced to forget all my old tidy eval knowledge and try to implement the new dplyr 1.0.0 logic.

It didn't go very well and it took me over an hour to wrap my head around it, but I guess that's just because I'm not used to the new functions yet Anyway, I found what appears to be an inconsistency of the all_of function when used in different dplyr functions and was wondering if anyone could clarify this.

all_of and select()

myData = data.frame(x = c("A", "B"), y = 1:6)
myColumn = "x"
myData %>% select(all_of(myColumn))
  x
1 A
2 B
3 A
4 B
5 A
6 B

Here, the all_of function can be used directly in the select() function as it converts a string (or vector of strings) into column names

all_of and group_by

myData = data.frame(x = c("A", "B"), y = 1:6)
myColumn = "x"

myData %>% group_by(all_of(myColumn)) %>% summarise(y = sum(y))
# A tibble: 1 x 2
  `all_of(myColumn)`     y
  <chr>              <int>
1 x                     21

This output is not what you'd expect. After searching for a long time and trying out different things, I could fix it by wrapping the all_of in the the across() function:

myData = data.frame(x = c("A", "B"), y = 1:6)
myColumn = "x"

myData %>% group_by(across(all_of(myColumn))) %>% summarise(y = sum(y))
# A tibble: 2 x 2
  x         y
  <chr> <int>
1 A         9
2 B        12

Can someone explain to me why I need the across wrapper for the group_by(), but not for the select()?

Thanks!
PJ

nirgrahamuk · July 16, 2020, 8:35am

I think the answer is that all_of is part of tidyselect and is designed only for use in select verbs and not other verbs. For group by verb you would use across i think.

siddharthprabhu · July 16, 2020, 10:32am

nirgrahamuk is correct. But just in case you're interested in a more in-depth explanation, here's the long version.

There are two distinct "flavours" of verbs in dplyr; selection verbs and action verbs.

Selection verbs (like select or rename) understand names and positions. For example, when we type select(iris, Sepal.Length:Petal.Length), select is actually translating those names into their corresponding positions. Thus the result is the same as if we had typed select(iris, 1:3). These verbs support the use of tidyselect helpers such as starts_with() or all_of().

Action verbs like mutate or summarise on the other hand generate new vectors. Supplying column positions to these verbs does not make sense. In fact, group_by() is also an action verb, because you can generate new grouping variables on-the-fly like so:

library(dplyr, warn.conflicts = FALSE)

iris %>% 
  group_by(Sepal.Length > 5) %>% 
  summarise(n = n())
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 2
#>   `Sepal.Length > 5`     n
#>   <lgl>              <int>
#> 1 FALSE                 32
#> 2 TRUE                 118

^{Created on 2020-07-16 by the reprex package (v0.3.0)}

Given this background, it should be easy to understand why supplying all_of() to group_by() didn't work. group_by() tried to create a new vector called all_of(myColumn) by recycling the value x to the length of the result.

What we need are variants of these action verbs that understand selections viz. the scoped variant _at which has the vars() helper function. So you could use group_by_at() instead.

library(dplyr, warn.conflicts = FALSE)

myData <- data.frame(x = c("A", "B"), y = 1:6)

myColumn <- "x"

myData %>% group_by_at(all_of(myColumn)) %>% summarise(y = sum(y))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 2
#>   x         y
#>   <chr> <int>
#> 1 A         9
#> 2 B        12

^{Created on 2020-07-16 by the reprex package (v0.3.0)}

And since across() is the modern re-imagination of the scoped variants, it's hardly surprising that it understands selections.

pieterjanvc · July 16, 2020, 12:36pm

Hi,

Thanks both for the reply!

@siddharthprabhu, I really appreciate the explanation! It makes so much more sense now I had never thought about selection and action verbs and didn't know that group_by could actually create new groups (that's cool!).

Just as a confirmation then, is the way I wrote the code with the group_by(across(all_of( ))) the correct and currently preferred structure then?

Thanks again!
PJ

siddharthprabhu · July 16, 2020, 12:48pm

Yes, I believe so. Using all_of() is recommended when passing variable names stored as strings to selection verbs in order to avoid unintentional data masking. I've supplied an example of this below.

library(dplyr, warn.conflicts = FALSE)

name <- c("height", "mass")

# Since starwars contains a data variable called name, it has priority and the
# environment variable name will not be used.
select(starwars, name)
#> # A tibble: 87 x 1
#>    name              
#>    <chr>             
#>  1 Luke Skywalker    
#>  2 C-3PO             
#>  3 R2-D2             
#>  4 Darth Vader       
#>  5 Leia Organa       
#>  6 Owen Lars         
#>  7 Beru Whitesun lars
#>  8 R5-D4             
#>  9 Biggs Darklighter 
#> 10 Obi-Wan Kenobi    
#> # ... with 77 more rows

# To disambiguate and force the environment variable, use all_of().
select(starwars, all_of(name))
#> # A tibble: 87 x 2
#>    height  mass
#>     <int> <dbl>
#>  1    172    77
#>  2    167    75
#>  3     96    32
#>  4    202   136
#>  5    150    49
#>  6    178   120
#>  7    165    75
#>  8     97    32
#>  9    183    84
#> 10    182    77
#> # ... with 77 more rows

^{Created on 2020-07-16 by the reprex package (v0.3.0)}

pieterjanvc · July 23, 2020, 12:56pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.