Is there a dplyr function which corresponds to the pattern group_by + slice + ungroup?

Andrea · August 25, 2018, 10:01am

I use group_by multiple times in my code, which is great because it's very useful! However, with great powers come great responsibilities and I am responsible for ungrouping the tibble, which I sometimes forget to do. Functions such add_count and add_tally are excellent in this respect because they free the user from the burden of remembering to ungroup every time.

In my use case, I often need to summarize my dataframe by retaining only the first or the last element of each group, i.e., my summary function is slice. Example:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(magrittr)

# generate sample data
n <- 100
ngroups <- 10
my_df <- tibble(x = runif(n*ngroups), 
                  y = rnorm(n*ngroups), 
                  group = rep(LETTERS[1:ngroups], each = n)
                )

# slice
my_df %<>% 
  group_by(group) %>%
  slice(1) %>%
  ungroup

my_df
#> # A tibble: 10 x 3
#>         x      y group
#>     <dbl>  <dbl> <chr>
#>  1 0.136   0.647 A    
#>  2 0.606   1.22  B    
#>  3 0.919  -0.712 C    
#>  4 0.0421  0.634 D    
#>  5 0.199  -0.229 E    
#>  6 0.413  -0.343 F    
#>  7 0.699  -0.750 G    
#>  8 0.725  -0.183 H    
#>  9 0.722  -0.172 I    
#> 10 0.158  -1.13  J

Created on 2018-08-25 by the reprex package (v0.2.0).

As you can see, I used the group_by + slice() + ungroup pattern.

Question: is there a dplyr function which corresponds to this pattern? If not, is there some useful trick to forget to ungroup in such a situation? Of course my real use case is much more complex, i.e., the function is longer and not always based on pipes (I don't use pipes for very long dplyr workflows, or for functions which need to be called a large number or times).

cderv · August 25, 2018, 10:16am

A trick to not forget to ungroup is to create a wrapper function using tidyeval. Reusing your example, something like that

library(dplyr, warn.conflicts = FALSE)

# generate sample data
n <- 100
ngroups <- 10
set.seed(1000)
my_df <- tibble(x = runif(n*ngroups), 
                y = rnorm(n*ngroups), 
                group = rep(LETTERS[1:ngroups], each = n)
)

# slice
my_df %>% 
  group_by(group) %>%
  slice(1) %>%
  ungroup
#> # A tibble: 10 x 3
#>         x      y group
#>     <dbl>  <dbl> <chr>
#>  1 0.328   0.692 A    
#>  2 0.599  -0.151 B    
#>  3 0.0854  1.24  C    
#>  4 0.170   1.39  D    
#>  5 0.821  -0.239 E    
#>  6 0.724   2.08  F    
#>  7 0.735  -2.33  G    
#>  8 0.342   0.136 H    
#>  9 0.256  -0.724 I    
#> 10 0.947   0.247 J

# wrapper 
slice_by_group <- function(df, group, ...) {
  group <- enquo(group)
  group_by(df, !! group) %>%
    slice(...) %>%
    ungroup()
}

my_df %>%
  slice_by_group(group, 1)
#> # A tibble: 10 x 3
#>         x      y group
#>     <dbl>  <dbl> <chr>
#>  1 0.328   0.692 A    
#>  2 0.599  -0.151 B    
#>  3 0.0854  1.24  C    
#>  4 0.170   1.39  D    
#>  5 0.821  -0.239 E    
#>  6 0.724   2.08  F    
#>  7 0.735  -2.33  G    
#>  8 0.342   0.136 H    
#>  9 0.256  -0.724 I    
#> 10 0.947   0.247 J

Created on 2018-08-25 by the reprex package (v0.2.0).

cderv · August 25, 2018, 10:23am

You can also work with list columns using purrr and tidyr. First, nest your data and then work on the list column as needed. If it is not a long workflow, you'll have to unnest pretty quickly, like ungroup

library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(purrr)
# generate sample data
n <- 100
ngroups <- 10
set.seed(1000)
my_df <- tibble(x = runif(n*ngroups), 
                y = rnorm(n*ngroups), 
                group = rep(LETTERS[1:ngroups], each = n)
)

my_df %>%
  nest(-group) %>%
  mutate(data_sliced = map(data, ~ slice(.x, 1)))
#> # A tibble: 10 x 3
#>    group data               data_sliced     
#>    <chr> <list>             <list>          
#>  1 A     <tibble [100 x 2]> <tibble [1 x 2]>
#>  2 B     <tibble [100 x 2]> <tibble [1 x 2]>
#>  3 C     <tibble [100 x 2]> <tibble [1 x 2]>
#>  4 D     <tibble [100 x 2]> <tibble [1 x 2]>
#>  5 E     <tibble [100 x 2]> <tibble [1 x 2]>
#>  6 F     <tibble [100 x 2]> <tibble [1 x 2]>
#>  7 G     <tibble [100 x 2]> <tibble [1 x 2]>
#>  8 H     <tibble [100 x 2]> <tibble [1 x 2]>
#>  9 I     <tibble [100 x 2]> <tibble [1 x 2]>
#> 10 J     <tibble [100 x 2]> <tibble [1 x 2]>

Created on 2018-08-25 by the reprex package (v0.2.0).

For more information, look at the recent webinar from Rstudio
https://www.rstudio.com/resources/videos/how-to-work-with-list-columns/
and the other one
https://www.rstudio.com/resources/webinars/thinking-inside-the-box-you-can-do-that-inside-a-data-frame/

Andrea · August 25, 2018, 5:11pm

Real nice! Thanks! Just a question: I've seen quite a few tidyeval answers lately (which is great, because it's about time I learn it for good however, the verb used to "capture" the input variable (not sure if it's the right term) is always different. Your answer uses enquo:

group <- enquo(group)
group_by(df, !! group) %>%

This answer to another question uses enexpr:

my_col <- enexpr(my_col)
output <- data %>% 
    mutate(!!my_col := as.integer(!!my_col))

Both your answer and the other answer then apply !! to the result of enquo/enexpr. Is there a reason to prefer enquo or enexpr?

Andrea · August 25, 2018, 5:12pm

Hmmm, sorry but I don't understand what nest achieves here. The tibble you created is not the one I was looking for. Is something missing in this answer? Or (more probably) am I missing something?

alistaire · August 25, 2018, 5:38pm

You can unnest and get the same result:

library(tidyverse)
set.seed(47)

n <- 100
ngroups <- 10
my_df <- tibble(x = runif(n*ngroups), 
                y = rnorm(n*ngroups), 
                group = rep(LETTERS[1:ngroups], each = n))

nested <- my_df %>% 
    nest(-group) %>% 
    mutate(data = map(data, slice, 1)) %>% 
    unnest()

grouped <- my_df %>% 
    group_by(group) %>% 
    slice(1) %>% 
    ungroup()

all_equal(nested, grouped)
#> [1] TRUE

nested
#> # A tibble: 10 x 3
#>    group     x      y
#>    <chr> <dbl>  <dbl>
#>  1 A     0.977  0.603
#>  2 B     0.834 -0.575
#>  3 C     0.443  0.626
#>  4 D     0.984  1.05 
#>  5 E     0.965 -2.39 
#>  6 F     0.195  1.35 
#>  7 G     0.921 -0.846
#>  8 H     0.194 -0.517
#>  9 I     0.555 -1.52 
#> 10 J     0.329 -1.27

grouped
#> # A tibble: 10 x 3
#>        x      y group
#>    <dbl>  <dbl> <chr>
#>  1 0.977  0.603 A    
#>  2 0.834 -0.575 B    
#>  3 0.443  0.626 C    
#>  4 0.984  1.05  D    
#>  5 0.965 -2.39  E    
#>  6 0.195  1.35  F    
#>  7 0.921 -0.846 G    
#>  8 0.194 -0.517 H    
#>  9 0.555 -1.52  I    
#> 10 0.329 -1.27  J

In this particular context it's not more concise, but there are certainly cases where the nesting idiom is a convenient alternative to grouping.

mara · August 25, 2018, 7:19pm

Though this is basically an aside from the rest of this question, there's a helpful table/ ~rule of thumb from Hadley's Tidy evaluation: programming with ggplot2 slide deck

Andrea · August 26, 2018, 7:27am

Thank you very much, @mara! I still don't understand very well the difference (enquo includes the user environment, ok, but when should I prefer to capture the user's environment and when not?). Anyway, the last slide of your presentation clearly states that the preferred pattern is enquo + !! (bang bang! I like the name , so I will use it from now on.

PS this pitch is great! It helped me a lot, and probably people with more knowledge of Computer Science and Metaprogramming than me will have all their doubts cleared. If someone is keeping a repository of tidyeval resources here on this community, this pitch should be definitely included, in case it's not there.

cderv · August 26, 2018, 8:39am

quosure with quo include the environment in which the variable should be evaluated. With expr, you just have the variable to be evaluated when called. A small example,

dummy_fun <- function(x, y = 3) {
  res_enquo <- rlang::eval_tidy(to_print <- rlang::enquo(x))
  message("When using quo/enquo, you have the environment:")
  print(to_print)
  res_enexpr <- rlang::eval_tidy(to_print <- rlang::enexpr(x))
  message("When using expr/enexpr, you just have the symbol (or name):")
  print(to_print)
  message(glue::glue("Inside this function, y value is {y}",
             "With enquo, results is : {res_enquo}",
             "With enexpr, results is : {res_enexpr}", 
             .sep = "\n"))
}

# define a value of y outside d
y <- 2
dummy_fun(y, y = 10)
#> When using quo/enquo, you have the environment:
#> <quosure>
#>   expr: ^y
#>   env:  global
#> When using expr/enexpr, you just have the symbol (or name):
#> y
#> Inside this function, y value is 10
#> With enquo, results is : 2
#> With enexpr, results is : 10
dummy_fun(y, y = 5)
#> When using quo/enquo, you have the environment:
#> <quosure>
#>   expr: ^y
#>   env:  global
#> When using expr/enexpr, you just have the symbol (or name):
#> y
#> Inside this function, y value is 5
#> With enquo, results is : 2
#> With enexpr, results is : 5

You see that if you use enquo, you get the argument x content provided by the user + the environment associated. So I get y that I defined outside the function. When using enexpr, you get the content of x I provided, so y but no information on the environment associated. So when evaluated with eval_tidy here, (same with !!), it will be evaluated in the context of the function environment, where I defined voluntarily for example a y variable, that I defined by argument function.

So, when you know you'll want to get the quosure from the user argument to be evaluated as defined elsewhere by the user, you need quo. When you just need an expression or a symbol to be evaluated in the context you define yourself, you need enexpr. Hope it is clearer !

To help begin with tidyeval, there is friendlyeval that help you program with dplyr, and transform easily with an RStudio addin to rlang code syntax.

mara · August 26, 2018, 9:59am

It's in here:

mdsumner · August 28, 2018, 1:14pm

would that be more general with the slice first, so that grouping can be optional or multiple?

function(df, .slice, ...) {
  .slice <- enquo(.slice)
  group_by(df, ...) %>%
    slice(!! .slice) %>%
    ungroup()
}

Apologies if I'm off-base, I really appreciated your example and got a lot from it!

cderv · August 28, 2018, 3:36pm

Yes, you're right. If you want to create a more generalized function, it would work.

library(dplyr, warn.conflicts = FALSE)
#> Warning: le package 'dplyr' a été compilé avec la version R 3.4.4

# generate sample data
n <- 100
ngroups <- 10
set.seed(1000)
my_df <- tibble(x = runif(n*ngroups), 
                y = rnorm(n*ngroups), 
                group = rep(LETTERS[1:ngroups], each = n)
)

slice_by_group <- function(df, .slice, ...) {
  .slice <- enquo(.slice)
  group_by(df, ...) %>%
    slice(!! .slice) %>%
    ungroup()
}

# one group
my_df %>%
  slice_by_group(1, group) %>%
  head()
#> # A tibble: 6 x 3
#>        x      y group
#>    <dbl>  <dbl> <chr>
#> 1 0.328   0.692 A    
#> 2 0.599  -0.151 B    
#> 3 0.0854  1.24  C    
#> 4 0.170   1.39  D    
#> 5 0.821  -0.239 E    
#> 6 0.724   2.08  F
my_df %>%
  slice_by_group(1:2, group) %>%
  head()
#> Warning: le package 'bindrcpp' a été compilé avec la version R 3.4.4
#> # A tibble: 6 x 3
#>        x      y group
#>    <dbl>  <dbl> <chr>
#> 1 0.328   0.692 A    
#> 2 0.759   0.389 A    
#> 3 0.599  -0.151 B    
#> 4 0.452  -0.105 B    
#> 5 0.0854  1.24  C    
#> 6 0.332  -1.28  C

# two groups
new_var <- sample(1:2, size = nrow(my_df), replace = TRUE)
my_df %>%
  mutate(group2 = new_var) %>%
  slice_by_group(1, group, group2) %>%
  head()
#> # A tibble: 6 x 4
#>        x      y group group2
#>    <dbl>  <dbl> <chr>  <int>
#> 1 0.759   0.389 A          1
#> 2 0.328   0.692 A          2
#> 3 0.599  -0.151 B          1
#> 4 0.452  -0.105 B          2
#> 5 0.332  -1.28  C          1
#> 6 0.0854  1.24  C          2

my_df %>%
  mutate(group2 = new_var) %>%
  slice_by_group(c(1,3), group, group2) %>%
  head()
#> # A tibble: 6 x 4
#>       x      y group group2
#>   <dbl>  <dbl> <chr>  <int>
#> 1 0.759  0.389 A          1
#> 2 0.691 -1.95  A          1
#> 3 0.328  0.692 A          2
#> 4 0.866  0.524 A          2
#> 5 0.599 -0.151 B          1
#> 6 0.846 -2.04  B          1

Created on 2018-08-28 by the reprex package (v0.2.0).

Apologies if I'm off-base, I really appreciated your example and got a lot from it!

Glad, I helped you understand!

cderv · August 28, 2018, 4:59pm

However @mdsumner I wonder if it should'nt be

slice_by_group <- function(df, .slice, ...) {
  .slice <- enquo(.slice)
  groups <- enquos(...)
  group_by(df, !!! groups) %>%
    slice(!! .slice) %>%
    ungroup()
}

But I am still ensure of the added value. it gives the same results. Maybe not with group_by verbs or in a use case you want to "play" with the argument that is in quosure. Something I need to dig into !