Is ungroup() recommended after every group_by()?

dplyr

#1

I once had a problem that was solved with ungroup() so I started using it all the time, but wondering if it’s really necessary. Would love to hear what others do.


#2

I tend to use ungroup() after every group_by() for a few reasons:

  1. Avoid potential unintended errors due to the grouping.
  2. Makes pipes more readable by explicitly pointing out places where the data is being operated on according to groups.
  3. I like to save transformed datasets as .Rdata objects to speed up loading times for scripts I run often and Shiny apps. The groupings are retained in such objects. By ensuring that I always ungroup, I avoid situations where I load an .Rdata object a year later and struggle with a problem not realizing a grouping has been applied.

#3

Hi,

That’s very helpful… I will continue to ungroup. I was going to ask if you had an example in which not doing so caused a problem but I answered my own question (by chance). Here’s a MWE that reproduces the error I got:

> data.frame(Titanic) %>% 
     group_by(Class, Age) %>% 
     summarize(Freq = sum(Freq)) %>% 
     mutate(Class = reorder(Class, Freq))
Error in mutate_impl(.data, dots) : 
  Column `Class` can't be modified because it's a grouping variable

Interestingly, though, it doesn’t happen with just one group_by variable:

> data.frame(Titanic) %>% 
     group_by(Class) %>% 
     summarize(Freq = sum(Freq)) %>%  
     mutate(Class = reorder(Class, Freq))
# A tibble: 4 x 2
  Class  Freq
  <fct> <dbl>
1 1st     325
2 2nd     285
3 3rd     706
4 Crew    885

I don’t fully comprehend what it means for a variable to be a grouping variable, but this is at least a start.

EDIT (in response to @danr’s post): To sum up the context: I am not asking for help debugging this code. I know that the problem is that I didn’t ungroup(). The point is to illustrate why it’s important to use ungroup().


#4

group_by adds metadata to a data.frame that marks how rows should be grouped. As long as that metadata is there you won’t be able to change the factors of the columns involved in the grouping. See the following examples.

You should use a reproducible example for your code. See:

As is with your code it isn’t possible to tell is you meant to use plyr::summarize or dplyr::summarize.

Also a reprex makes it possible for us to just copy paste you code and be able to run it in the same environment that you did. Everyone here is answering questions on their own time so we ask that you do what you can to minimize that time… a reprex is the best way to do that.

suppressPackageStartupMessages(library(dplyr))

# first of all dplyr::group_by adds meta-data to
# the data.frame that other functions, like 
# dplry::summaraize use when the do calculations

t1 <- data.frame(Titanic) %>%
   group_by(Class, Age)

# notice that the meta-data show how rows
# should be grouped
str(t1)
#> Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':  32 obs. of  5 variables:
#>  $ Class   : Factor w/ 4 levels "1st","2nd","3rd",..: 1 2 3 4 1 2 3 4 1 2 ...
#>  $ Sex     : Factor w/ 2 levels "Male","Female": 1 1 1 1 2 2 2 2 1 1 ...
#>  $ Age     : Factor w/ 2 levels "Child","Adult": 1 1 1 1 1 1 1 1 2 2 ...
#>  $ Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ Freq    : num  0 0 35 0 0 0 17 0 118 154 ...
#>  - attr(*, "vars")= chr  "Class" "Age"
#>  - attr(*, "drop")= logi TRUE
#>  - attr(*, "indices")=List of 8
#>   ..$ : int  0 4 16 20
#>   ..$ : int  8 12 24 28
#>   ..$ : int  1 5 17 21
#>   ..$ : int  9 13 25 29
#>   ..$ : int  2 6 18 22
#>   ..$ : int  10 14 26 30
#>   ..$ : int  3 7 19 23
#>   ..$ : int  11 15 27 31
#>  - attr(*, "group_sizes")= int  4 4 4 4 4 4 4 4
#>  - attr(*, "biggest_group_size")= int 4
#>  - attr(*, "labels")='data.frame':   8 obs. of  2 variables:
#>   ..$ Class: Factor w/ 4 levels "1st","2nd","3rd",..: 1 1 2 2 3 3 4 4
#>   ..$ Age  : Factor w/ 2 levels "Child","Adult": 1 2 1 2 1 2 1 2
#>   ..- attr(*, "vars")= chr  "Class" "Age"
#>   ..- attr(*, "drop")= logi TRUE

Created on 2018-02-16 by the reprex package (v0.2.0).

suppressPackageStartupMessages(library(dplyr))

# dplyr::summerize passes along that information 
t2 <- data.frame(Titanic) %>% 
    group_by(Class, Age) %>% 
    summarize(Freq = sum(Freq))
t2
#> # A tibble: 8 x 3
#> # Groups:   Class [?]
#>   Class Age     Freq
#>   <fct> <fct>  <dbl>
#> 1 1st   Child   6.00
#> 2 1st   Adult 319   
#> 3 2nd   Child  24.0 
#> 4 2nd   Adult 261   
#> 5 3rd   Child  79.0 
#> 6 3rd   Adult 627   
#> 7 Crew  Child   0   
#> 8 Crew  Adult 885

str(t2)
#> Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':  8 obs. of  3 variables:
#>  $ Class: Factor w/ 4 levels "1st","2nd","3rd",..: 1 1 2 2 3 3 4 4
#>  $ Age  : Factor w/ 2 levels "Child","Adult": 1 2 1 2 1 2 1 2
#>  $ Freq : num  6 319 24 261 79 627 0 885
#>  - attr(*, "vars")= chr "Class"
#>  - attr(*, "drop")= logi TRUE

Created on 2018-02-16 by the reprex package (v0.2.0).

# the following fails because mutate is trying
# change one of the columns used by group_by
# and it can see that because of the meta-data
# passed through by dplyr::summarize
suppressPackageStartupMessages(library(dplyr))
t3 <- data.frame(Titanic) %>% 
    group_by(Class, Age) %>% 
    summarize(Freq = sum(Freq)) %>%
    mutate(Class = reorder(Class, Freq))
#> Error in mutate_impl(.data, dots): Column `Class` can't be modified because it's a grouping variable

Created on 2018-02-16 by the reprex package (v0.2.0).

# ungroup removes any grouping meta-data so
suppressPackageStartupMessages(library(dplyr))
t4 <- data.frame(Titanic) %>% 
    group_by(Class, Age) %>% 
    ungroup()

# notice there is no grouping meta-data in t4
str(t4)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    32 obs. of  5 variables:
#>  $ Class   : Factor w/ 4 levels "1st","2nd","3rd",..: 1 2 3 4 1 2 3 4 1 2 ...
#>  $ Sex     : Factor w/ 2 levels "Male","Female": 1 1 1 1 2 2 2 2 1 1 ...
#>  $ Age     : Factor w/ 2 levels "Child","Adult": 1 1 1 1 1 1 1 1 2 2 ...
#>  $ Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ Freq    : num  0 0 35 0 0 0 17 0 118 154 ...

Created on 2018-02-16 by the reprex package (v0.2.0).

suppressPackageStartupMessages(library(dplyr))

# so by ungroup before running mutate
# lets the factors be changed
suppressPackageStartupMessages(library(dplyr))
t5 <- data.frame(Titanic) %>% 
    group_by(Class, Age) %>% 
    summarize(Freq = sum(Freq)) %>%
    ungroup() %>%
    mutate(Class = reorder(Class, Freq))

str(t5)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    8 obs. of  3 variables:
#>  $ Class: Factor w/ 4 levels "2nd","1st","3rd",..: 2 2 1 1 3 3 4 4
#>   ..- attr(*, "scores")= num [1:4(1d)] 162 142 353 442
#>   .. ..- attr(*, "dimnames")=List of 1
#>   .. .. ..$ : chr  "1st" "2nd" "3rd" "Crew"
#>  $ Age  : Factor w/ 2 levels "Child","Adult": 1 2 1 2 1 2 1 2
#>  $ Freq : num  6 319 24 261 79 627 0 885

Created on 2018-02-16 by the reprex package (v0.2.0).


#5

Yes, I figured out from the error message that I can’t reorder factor levels if the variable is grouped. That does not provide a full explanation of what you can and can’t do when variables are grouped.

With all due respect for your time and attempt to answer the question, I think your long explanation of what a MWE is is overkill in this situation for a few reasons:

  1. I posted the question under the “tidyverse” and even added a dplyr tag. plyr is not part of the tidyverse.
  2. The code I posted was meant as an example of what doesn’t work with grouped variables. People who don’t ungroup might be interested in this example. There is no reason that anyone should have to run that code, only observe that you can’t always reorder factor levels with grouped variables.
  3. Most importantly, the RStudio community forum is meant to be a friendly place where you don’t get the kind of answers that you get on Stack Overflow. It’s off-putting to be told “Everyone here is answering questions on their own time so we ask that you do what you can to minimize that time.”

#6

In my experience, the most common error that results from grouping is the one you showed. You can’t performance any operations on the grouping variables, meaning they can’t be mutated or summarized. I tend to deal with this issue when I’m using ggplot2 to visualize a dataset. I often struggle to get the labels just right using calculations, so I’ll typically create a summarized view of the dataset to calculate labels that I’ll need in my visual (the values for averages, medians, etc.). Whenever I have problems getting the summary view to work, it’s typically because I applied some sort of grouping to the main dataset and forgot to remove it.


#7

That’s exactly what I was doing: changing factor levels for a plot. Ungroup to the rescue.