Is ungroup() recommended after every group_by()?

I once had a problem that was solved with ungroup() so I started using it all the time, but wondering if it's really necessary. Would love to hear what others do.

7 Likes

I tend to use ungroup() after every group_by() for a few reasons:

  1. Avoid potential unintended errors due to the grouping.
  2. Makes pipes more readable by explicitly pointing out places where the data is being operated on according to groups.
  3. I like to save transformed datasets as .Rdata objects to speed up loading times for scripts I run often and Shiny apps. The groupings are retained in such objects. By ensuring that I always ungroup, I avoid situations where I load an .Rdata object a year later and struggle with a problem not realizing a grouping has been applied.
9 Likes

Hi,

That's very helpful... I will continue to ungroup. I was going to ask if you had an example in which not doing so caused a problem but I answered my own question (by chance). Here's a MWE that reproduces the error I got:

> data.frame(Titanic) %>% 
     group_by(Class, Age) %>% 
     summarize(Freq = sum(Freq)) %>% 
     mutate(Class = reorder(Class, Freq))
Error in mutate_impl(.data, dots) : 
  Column `Class` can't be modified because it's a grouping variable

Note that it doesn't happen with just one group_by variable since summarize() removes the last grouping variable:

> data.frame(Titanic) %>% 
     group_by(Class) %>% 
     summarize(Freq = sum(Freq)) %>%  
     mutate(Class = reorder(Class, Freq))
# A tibble: 4 x 2
  Class  Freq
  <fct> <dbl>
1 1st     325
2 2nd     285
3 3rd     706
4 Crew    885

EDIT (in response to @danr's post): To sum up the context: I am not asking for help debugging this code. I know that the problem is that I didn't ungroup(). The point is to illustrate why it's important to use ungroup().

2 Likes

group_by adds metadata to a data.frame that marks how rows should be grouped. As long as that metadata is there you won't be able to change the factors of the columns involved in the grouping. See the following examples.

You should use a reproducible example for your code. See:

https://www.jessemaegan.com/post/so-you-ve-been-asked-to-make-a-reprex

As is with your code it isn't possible to tell is you meant to use plyr::summarize or dplyr::summarize.

Also a reprex makes it possible for us to just copy paste you code and be able to run it in the same environment that you did. Everyone here is answering questions on their own time so we ask that you do what you can to minimize that time... a reprex is the best way to do that.

suppressPackageStartupMessages(library(dplyr))

# first of all dplyr::group_by adds meta-data to
# the data.frame that other functions, like 
# dplry::summaraize use when the do calculations

t1 <- data.frame(Titanic) %>%
   group_by(Class, Age)

# notice that the meta-data show how rows
# should be grouped
str(t1)
#> Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':  32 obs. of  5 variables:
#>  $ Class   : Factor w/ 4 levels "1st","2nd","3rd",..: 1 2 3 4 1 2 3 4 1 2 ...
#>  $ Sex     : Factor w/ 2 levels "Male","Female": 1 1 1 1 2 2 2 2 1 1 ...
#>  $ Age     : Factor w/ 2 levels "Child","Adult": 1 1 1 1 1 1 1 1 2 2 ...
#>  $ Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ Freq    : num  0 0 35 0 0 0 17 0 118 154 ...
#>  - attr(*, "vars")= chr  "Class" "Age"
#>  - attr(*, "drop")= logi TRUE
#>  - attr(*, "indices")=List of 8
#>   ..$ : int  0 4 16 20
#>   ..$ : int  8 12 24 28
#>   ..$ : int  1 5 17 21
#>   ..$ : int  9 13 25 29
#>   ..$ : int  2 6 18 22
#>   ..$ : int  10 14 26 30
#>   ..$ : int  3 7 19 23
#>   ..$ : int  11 15 27 31
#>  - attr(*, "group_sizes")= int  4 4 4 4 4 4 4 4
#>  - attr(*, "biggest_group_size")= int 4
#>  - attr(*, "labels")='data.frame':   8 obs. of  2 variables:
#>   ..$ Class: Factor w/ 4 levels "1st","2nd","3rd",..: 1 1 2 2 3 3 4 4
#>   ..$ Age  : Factor w/ 2 levels "Child","Adult": 1 2 1 2 1 2 1 2
#>   ..- attr(*, "vars")= chr  "Class" "Age"
#>   ..- attr(*, "drop")= logi TRUE

Created on 2018-02-16 by the reprex package (v0.2.0).

suppressPackageStartupMessages(library(dplyr))

# dplyr::summerize passes along that information 
t2 <- data.frame(Titanic) %>% 
    group_by(Class, Age) %>% 
    summarize(Freq = sum(Freq))
t2
#> # A tibble: 8 x 3
#> # Groups:   Class [?]
#>   Class Age     Freq
#>   <fct> <fct>  <dbl>
#> 1 1st   Child   6.00
#> 2 1st   Adult 319   
#> 3 2nd   Child  24.0 
#> 4 2nd   Adult 261   
#> 5 3rd   Child  79.0 
#> 6 3rd   Adult 627   
#> 7 Crew  Child   0   
#> 8 Crew  Adult 885

str(t2)
#> Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':  8 obs. of  3 variables:
#>  $ Class: Factor w/ 4 levels "1st","2nd","3rd",..: 1 1 2 2 3 3 4 4
#>  $ Age  : Factor w/ 2 levels "Child","Adult": 1 2 1 2 1 2 1 2
#>  $ Freq : num  6 319 24 261 79 627 0 885
#>  - attr(*, "vars")= chr "Class"
#>  - attr(*, "drop")= logi TRUE

Created on 2018-02-16 by the reprex package (v0.2.0).

# the following fails because mutate is trying
# change one of the columns used by group_by
# and it can see that because of the meta-data
# passed through by dplyr::summarize
suppressPackageStartupMessages(library(dplyr))
t3 <- data.frame(Titanic) %>% 
    group_by(Class, Age) %>% 
    summarize(Freq = sum(Freq)) %>%
    mutate(Class = reorder(Class, Freq))
#> Error in mutate_impl(.data, dots): Column `Class` can't be modified because it's a grouping variable

Created on 2018-02-16 by the reprex package (v0.2.0).

# ungroup removes any grouping meta-data so
suppressPackageStartupMessages(library(dplyr))
t4 <- data.frame(Titanic) %>% 
    group_by(Class, Age) %>% 
    ungroup()

# notice there is no grouping meta-data in t4
str(t4)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    32 obs. of  5 variables:
#>  $ Class   : Factor w/ 4 levels "1st","2nd","3rd",..: 1 2 3 4 1 2 3 4 1 2 ...
#>  $ Sex     : Factor w/ 2 levels "Male","Female": 1 1 1 1 2 2 2 2 1 1 ...
#>  $ Age     : Factor w/ 2 levels "Child","Adult": 1 1 1 1 1 1 1 1 2 2 ...
#>  $ Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ Freq    : num  0 0 35 0 0 0 17 0 118 154 ...

Created on 2018-02-16 by the reprex package (v0.2.0).

suppressPackageStartupMessages(library(dplyr))

# so by ungroup before running mutate
# lets the factors be changed
suppressPackageStartupMessages(library(dplyr))
t5 <- data.frame(Titanic) %>% 
    group_by(Class, Age) %>% 
    summarize(Freq = sum(Freq)) %>%
    ungroup() %>%
    mutate(Class = reorder(Class, Freq))

str(t5)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    8 obs. of  3 variables:
#>  $ Class: Factor w/ 4 levels "2nd","1st","3rd",..: 2 2 1 1 3 3 4 4
#>   ..- attr(*, "scores")= num [1:4(1d)] 162 142 353 442
#>   .. ..- attr(*, "dimnames")=List of 1
#>   .. .. ..$ : chr  "1st" "2nd" "3rd" "Crew"
#>  $ Age  : Factor w/ 2 levels "Child","Adult": 1 2 1 2 1 2 1 2
#>  $ Freq : num  6 319 24 261 79 627 0 885

Created on 2018-02-16 by the reprex package (v0.2.0).

2 Likes

In my experience, the most common error that results from grouping is the one you showed. You can't performance any operations on the grouping variables, meaning they can't be mutated or summarized. I tend to deal with this issue when I'm using ggplot2 to visualize a dataset. I often struggle to get the labels just right using calculations, so I'll typically create a summarized view of the dataset to calculate labels that I'll need in my visual (the values for averages, medians, etc.). Whenever I have problems getting the summary view to work, it's typically because I applied some sort of grouping to the main dataset and forgot to remove it.

2 Likes

That's exactly what I was doing: changing factor levels for a plot. Ungroup to the rescue.