Is nest() + mutate() + map() + unnest() really the best alternative to dplyr::do()

dplyr::do

#1

How do you take any function, e.g. f(.data, ...) and make it work with grouped data?
dplyr::do() does this elegantly. The approach via nest() + mutate() + map() + unnest() also works but isn't exactly the same and buries the intention of this specific task.

Is there a succinct/evocative way to tackle this problem?

Here is my attempt:

library(tidyverse)

# I want to make any funciton work with grouped data.

# For example:
first_row <- function(.x, to_chr = FALSE) {
  first <- .x[1, ]
  if (to_chr) {
    first[] <- lapply(first, as.character)
  }
  
  tibble::as.tibble(first)
}

# Pulls the first row of a dataframe
first_row(mtcars)
#> # A tibble: 1 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#> * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1    21     6   160   110   3.9  2.62  16.5     0     1     4     4

# Optionally converts all columns to character
first_row(mtcars, to_chr = TRUE)
#> # A tibble: 1 x 11
#>   mpg   cyl   disp  hp    drat  wt    qsec  vs    am    gear  carb 
#> * <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 21    6     160   110   3.9   2.62  16.46 0     1     4     4

# To make it work with grouped data I can use dplyr::do()
mtcars %>% 
  group_by(cyl) %>% 
  do(first_row(., to_chr = TRUE))
#> # A tibble: 3 x 11
#> # Groups:   cyl [3]
#>   mpg   cyl   disp  hp    drat  wt    qsec  vs    am    gear  carb 
#>   <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 22.8  4     108   93    3.85  2.32  18.61 1     1     4     1    
#> 2 21    6     160   110   3.9   2.62  16.46 0     1     4     4    
#> 3 18.7  8     360   175   3.15  3.44  17.02 0     0     3     2

# But do() will be deprecated in favor of nest() + map() http://bit.ly/2uyzcwa 
mtcars %>% 
  group_by(cyl) %>% 
  nest() %>% 
  mutate(., data = map(data, first_row, to_chr = TRUE)) %>% 
  unnest()
#> # A tibble: 3 x 11
#>     cyl mpg   disp  hp    drat  wt    qsec  vs    am    gear  carb 
#>   <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1     6 21    160   110   3.9   2.62  16.46 0     1     4     4    
#> 2     4 22.8  108   93    3.85  2.32  18.61 1     1     4     1    
#> 3     8 18.7  360   175   3.15  3.44  17.02 0     0     3     2

# This isn't exactly the same because `cyl` wasn't converted to character.
# But worse, this approach burries my intention. Let's make it move evocative
by_group <- function(.x, .f, ...) {
  grouped <- tidyr::nest(.x)
  out <- dplyr::mutate(grouped, data = purrr::map(.data$data, .f, ...))
  tidyr::unnest(out)
}

mtcars %>% 
  group_by(cyl) %>% 
  by_group(first_row)
#> # A tibble: 3 x 11
#>     cyl   mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     6  21     160   110  3.9   2.62  16.5     0     1     4     4
#> 2     4  22.8   108    93  3.85  2.32  18.6     1     1     4     1
#> 3     8  18.7   360   175  3.15  3.44  17.0     0     0     3     2

mtcars %>% 
  group_by(cyl) %>% 
  by_group(first_row, to_chr = TRUE)
#> # A tibble: 3 x 11
#>     cyl mpg   disp  hp    drat  wt    qsec  vs    am    gear  carb 
#>   <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1     6 21    160   110   3.9   2.62  16.46 0     1     4     4    
#> 2     4 22.8  108   93    3.85  2.32  18.61 1     1     4     1    
#> 3     8 18.7  360   175   3.15  3.44  17.02 0     0     3     2

# This doesn't solve the problem with `cyl` but makes the code clearer.

Created on 2018-07-14 by the reprex package (v0.2.0.9000).


Should I move away from do() and rowwise()?
#2

Here is a polished version of my solution -- now preserving groups.

Helpfile: https://forestgeo.github.io/fgeo.tool/reference/by_group.html
Tests: https://github.com/forestgeo/fgeo.tool/blob/master/tests/testthat/test-by_group.R


#3

There's the option of skipping the nest() step because you want to modify the grouping/nesting variable as well. This is a straightforward split-apply-combine approach.

mtcars %>% 
  group_by(cyl) %>%
  split(group_indices(.)) %>% 
  purrr::map_df(first_row, to_chr = TRUE) %>% 
  ungroup()
#> # A tibble: 3 x 11
#>   mpg   cyl   disp  hp    drat  wt    qsec  vs    am    gear  carb 
#>   <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 22.8  4     108   93    3.85  2.32  18.61 1     1     4     1    
#> 2 21    6     160   110   3.9   2.62  16.46 0     1     4     4    
#> 3 18.7  8     360   175   3.15  3.44  17.02 0     0     3     2

#4

Now that I think about it, maybe a nest_by() helper or something would help where you nest() using grouping variables but the columns outside of the list column are indices. Like:

# Hypothetical code

mtcars %>% 
  nest_by(cyl)
#> # A tibble: 3 x 2
#>   .nest_id data              
#>      <int> <list>            
#> 1        2 <tibble [7 x 11]> 
#> 2        1 <tibble [11 x 11]>
#> 3        3 <tibble [14 x 11]>

(By the way, to do this mocked-up function and output above, I used:

library(tidyverse)
mtcars %>% 
  group_by(cyl) %>%
  tibble::add_column(.nest_id = group_indices(.)) %>% 
  ungroup() %>% 
  nest(-.nest_id)
#> # A tibble: 3 x 2
#>   .nest_id data              
#>      <int> <list>            
#> 1        2 <tibble [7 x 11]> 
#> 2        1 <tibble [11 x 11]>
#> 3        3 <tibble [14 x 11]>

)


#5

Thanks @tjmahr (https://forestgeo.github.io/fgeo.tool/reference/by_group.html#acknowledgments)! This idea helps me a lot and allows me to not-depend on tidyr (I'm already importing purrr and dplyr).


#6

There used to be by_slice and by_row functions in purrr. They were super powerful, but complex and the implementation was slow compared to equivalent methods. They still exist in their new home in purrrlyr, but I don't believe they will see further development there, as the nest/list columns approach is now the preferred idiom.


#7

FYI: dplyr::nest_by() is comming: