Why doesn't group_by %>% purr::f() behave the same way as summarise()?

bragks · November 7, 2018, 8:57am

Hi!
Working my way through Advanced R, and I guess at some basic level I can understand why this doesn't work, but could someone explain why the methods below return a different output? Thank you!

library(purrr)
library(dplyr, warn.conflicts = FALSE)

mtcars %>% 
  group_by(cyl) %>% 
  summarise_if(is.double, mean)
#> # A tibble: 3 x 11
#>     cyl   mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     4  26.7  105.  82.6  4.07  2.29  19.1 0.909 0.727  4.09  1.55
#> 2     6  19.7  183. 122.   3.59  3.12  18.0 0.571 0.429  3.86  3.43
#> 3     8  15.1  353. 209.   3.23  4.00  16.8 0     0.143  3.29  3.5

mtcars %>% 
  group_by(cyl) %>%
  modify_if(is.double, mean) %>% 
  head(3)
#> # A tibble: 3 x 11
#> # Groups:   cyl [1]
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  20.1  6.19  231.  147.  3.60  3.22  17.8 0.438 0.406  3.69  2.81
#> 2  20.1  6.19  231.  147.  3.60  3.22  17.8 0.438 0.406  3.69  2.81
#> 3  20.1  6.19  231.  147.  3.60  3.22  17.8 0.438 0.406  3.69  2.81

mtcars %>% 
  group_by(cyl) %>% 
  map_df(mean)
#> # A tibble: 1 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  20.1  6.19  231.  147.  3.60  3.22  17.8 0.438 0.406  3.69  2.81

^{Created on 2018-11-07 by the reprex package (v0.2.1)}

Edit: I guess this comes down to an overall understand of when to actually prefer/use purrr.

Another example, when/why should I use one method over the other?

library(purrr)
library(dplyr, warn.conflicts = FALSE)

modify_if(mtcars, is.double, ~ .x * 2) %>% head (2)
#>               mpg cyl disp  hp drat   wt  qsec vs am gear carb
#> Mazda RX4      42  12  320 220  7.8 5.24 32.92  0  2    8    8
#> Mazda RX4 Wag  42  12  320 220  7.8 5.75 34.04  0  2    8    8
mutate_if(mtcars, is.double, ~ .x * 2) %>% head(2)
#>   mpg cyl disp  hp drat   wt  qsec vs am gear carb
#> 1  42  12  320 220  7.8 5.24 32.92  0  2    8    8
#> 2  42  12  320 220  7.8 5.75 34.04  0  2    8    8

^{Created on 2018-11-07 by the reprex package (v0.2.1)}

prosoitos · November 7, 2018, 10:05am

The function descriptions are very helpful (for some of them at least).

1st case:

mtcars %>% 
  group_by(cyl) %>% 
  summarise_if(is.double, mean)
#> # A tibble: 3 x 11
#>     cyl   mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     4  26.7  105.  82.6  4.07  2.29  19.1 0.909 0.727  4.09  1.55
#> 2     6  19.7  183. 122.   3.59  3.12  18.0 0.571 0.429  3.86  3.43
#> 3     8  15.1  353. 209.   3.23  4.00  16.8 0     0.143  3.29  3.5

From summarise() help file:

‘summarise()’ is typically used on grouped data created by
‘group_by()’. The output will have one row for each group.

So summarise() works at the group level.

‘summarise_if’() operates on columns for which a predicate returns ‘TRUE’.

So you will get a mean for each group for all columns of type double (here, all columns) and you only get one row per group (here 3).

2nd case:

mtcars %>% 
  group_by(cyl) %>%
  modify_if(is.double, mean) %>% 
  head(3)
#> # A tibble: 3 x 11
#> # Groups:   cyl [1]
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  20.1  6.19  231.  147.  3.60  3.22  17.8 0.438 0.406  3.69  2.81
#> 2  20.1  6.19  231.  147.  3.60  3.22  17.8 0.438 0.406  3.69  2.81
#> 3  20.1  6.19  231.  147.  3.60  3.22  17.8 0.438 0.406  3.69  2.81

From modify() help file:

‘modify()’ is a short-cut for ‘x[ ] <- map(x, .f); return(x)’.
‘modify_if()’ only modifies the elements of ‘x’ that satisfy a
predicate and leaves the others unchanged.

So modify() will modify each value of your data frame by replacing it with the mean of that column and you get a data frame with the same number of rows as your input. Groups have no effect.

And since you have doubles everywhere, you would have gotten the same thing simply with:

mtcars %>% 
  group_by(cyl) %>%
  modify(mean) %>% 
  head(3)

3rd case:

mtcars %>% 
  group_by(cyl) %>% 
  map_df(mean)
#> # A tibble: 1 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  20.1  6.19  231.  147.  3.60  3.22  17.8 0.438 0.406  3.69  2.81

map() will not take the groups into account either, but unlike the previous, it does not "modify". Instead, it gives you the means for all values as the output (so only one mean per variable).

map() would have returned a list, but with map_df() you get a data frame with those means. So only one row.

You could have added a 4th case:

mtcars %>% 
  group_by(cyl) %>% 
  mutate_if(is.double, mean)

# A tibble: 32 x 11
# Groups:   cyl [3]
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  19.7     6  183. 122.   3.59  3.12  18.0 0.571 0.429  3.86  3.43
 2  19.7     6  183. 122.   3.59  3.12  18.0 0.571 0.429  3.86  3.43
 3  26.7     4  105.  82.6  4.07  2.29  19.1 0.909 0.727  4.09  1.55
 4  19.7     6  183. 122.   3.59  3.12  18.0 0.571 0.429  3.86  3.43
 5  15.1     8  353. 209.   3.23  4.00  16.8 0     0.143  3.29  3.5 
 6  19.7     6  183. 122.   3.59  3.12  18.0 0.571 0.429  3.86  3.43
 7  15.1     8  353. 209.   3.23  4.00  16.8 0     0.143  3.29  3.5 
 8  26.7     4  105.  82.6  4.07  2.29  19.1 0.909 0.727  4.09  1.55
 9  26.7     4  105.  82.6  4.07  2.29  19.1 0.909 0.727  4.09  1.55
10  19.7     6  183. 122.   3.59  3.12  18.0 0.571 0.429  3.86  3.43
# ... with 22 more rows

mutate(), as summarise(), takes the groups into account, but it does not summarise the data frame by only giving you the "summary" for each group. Instead, all the rows are maintained (as with modify()), but you get the means per group.

As for your additional question in your edit:

It doesn't really matter. The main difference between modify() and mutate() is that the former does not take groups into account while the 2nd does. But since you are not using group_by() in that last example, the outputs are pretty much the same (except that mutate() gets rid of the rownames and modify() does not). So I guess you could pick one or the other depending on whether you want them or not.

jcblum · November 7, 2018, 8:17pm

From the dplyr README

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges

dplyr is designed to abstract over how the data is stored. That means as well as working with local data frames, you can also work with remote database tables, using exactly the same R code.

From the purrr README:

purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors

So as I see it, purrr and dplyr have different focuses. (Obvious disclaimer: I don't set the direction of either of these packages, so this is an outsider's perspective!)

dplyr is focused on manipulating data frames, and on doing so in a way that can generalize to other rectangular, mixed-type data objects.

purrr is focused on implementing functional programming tools more broadly (e.g., lots of clever ways of looping). purrr functions operate on data frames thanks to the fact that data frames are just fancy lists, but purrr functions aren't designed around rectangular data the way dplyr is.

If I'm manipulating data frames, I start with dplyr, but keep some of the tools from purrr in my back pocket for the more exotic problems. And don't forget tidyr! nest() + purrr and gather()/spread() + dplyr are like my magical keys for unlocking a ton of weird data manipulation puzzles!

rensa · November 7, 2018, 9:39pm

One thing that took me a while to internalise is that most tidyverse functions are designed to operate on data frames, while most purrr functions are designed to operate on vectors or lists.

The fact that data frames also happen to be lists—of columns—makes this a lot more confusing

bragks · November 8, 2018, 11:23am

Thank you guys for clarifying! I feel kind of dim asking questions like this, but for a newbie reading the documentation sometimes feels like reading Dostoevsky backwards, in russian.

bragks · November 8, 2018, 11:34am

You wouldn't happen to have any worked examples of this? I feel like the concept of nesting in general is ok to understand, but I'm having some trouble seeing when I would use it.

rensa · November 8, 2018, 10:12pm

Jenny Bryan has written an excellent tutorial on list-columns and purrr, and I'm planning on writing a vignette for my new package on the weekend that also covers some of the use-cases of nesting

jcblum · November 8, 2018, 10:33pm

Good question! It took me a while to wrap my head around the possibilities of nest()/unnest(), too (and list-columns in general). There are a good number of examples floating around this site, but sadly they're not easily browsable as such. Here, in no particular order, are a few I dug up from my hazy memory (so obviously totally biased towards threads I've posted in ):

Another fantastic Jenny Bryan resource that touches on nesting:

mfherman · November 8, 2018, 10:44pm

The new DataCamp course, Machine Learning in the Tidyverse, is both excellent and makes extensive use of list-columns! Even if you don't have a DataCamp membership, the first chapter is available for free and could be a good place to start. From the description of Chapter 1:

This chapter will introduce you to the backbone of machine learning in the tidyverse, the List Column Workflow (LCW). The LCW will empower you to work with many models in one dataframe.

jcblum · November 11, 2018, 12:10am

@cderv just linked another great list-column learning resource over in a related topic:

I particularly like how Garrett’s webinar ties list columns in conceptually with the larger picture of the tidyverse and R.

bragks · November 15, 2018, 11:05am

Thank you again! I've gone through the datacamp course and this webinar by Garrett Grolemund. Both excellent resources, already using purrr in my daily workflow!

jcblum · November 22, 2018, 9:20pm

Hooray! Leaving this absolutely superfluous reply because I felt like a heart did not sufficiently convey my excitement at hearing that we helped you get to the purrr moment.