Is nest() + mutate() + map() + unnest() really the best alternative to dplyr::do()

mauro_lepore · July 14, 2018, 1:06pm

How do you take any function, e.g. f(.data, ...) and make it work with grouped data?
dplyr::do() does this elegantly. The approach via nest() + mutate() + map() + unnest() also works but isn't exactly the same and buries the intention of this specific task.

Is there a succinct/evocative way to tackle this problem?

Here is my attempt:

library(tidyverse)

# I want to make any funciton work with grouped data.

# For example:
first_row <- function(.x, to_chr = FALSE) {
  first <- .x[1, ]
  if (to_chr) {
    first[] <- lapply(first, as.character)
  }
  
  tibble::as.tibble(first)
}

# Pulls the first row of a dataframe
first_row(mtcars)
#> # A tibble: 1 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#> * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1    21     6   160   110   3.9  2.62  16.5     0     1     4     4

# Optionally converts all columns to character
first_row(mtcars, to_chr = TRUE)
#> # A tibble: 1 x 11
#>   mpg   cyl   disp  hp    drat  wt    qsec  vs    am    gear  carb 
#> * <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 21    6     160   110   3.9   2.62  16.46 0     1     4     4

# To make it work with grouped data I can use dplyr::do()
mtcars %>% 
  group_by(cyl) %>% 
  do(first_row(., to_chr = TRUE))
#> # A tibble: 3 x 11
#> # Groups:   cyl [3]
#>   mpg   cyl   disp  hp    drat  wt    qsec  vs    am    gear  carb 
#>   <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 22.8  4     108   93    3.85  2.32  18.61 1     1     4     1    
#> 2 21    6     160   110   3.9   2.62  16.46 0     1     4     4    
#> 3 18.7  8     360   175   3.15  3.44  17.02 0     0     3     2

# But do() will be deprecated in favor of nest() + map() http://bit.ly/2uyzcwa 
mtcars %>% 
  group_by(cyl) %>% 
  nest() %>% 
  mutate(., data = map(data, first_row, to_chr = TRUE)) %>% 
  unnest()
#> # A tibble: 3 x 11
#>     cyl mpg   disp  hp    drat  wt    qsec  vs    am    gear  carb 
#>   <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1     6 21    160   110   3.9   2.62  16.46 0     1     4     4    
#> 2     4 22.8  108   93    3.85  2.32  18.61 1     1     4     1    
#> 3     8 18.7  360   175   3.15  3.44  17.02 0     0     3     2

# This isn't exactly the same because `cyl` wasn't converted to character.
# But worse, this approach burries my intention. Let's make it move evocative
by_group <- function(.x, .f, ...) {
  grouped <- tidyr::nest(.x)
  out <- dplyr::mutate(grouped, data = purrr::map(.data$data, .f, ...))
  tidyr::unnest(out)
}

mtcars %>% 
  group_by(cyl) %>% 
  by_group(first_row)
#> # A tibble: 3 x 11
#>     cyl   mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     6  21     160   110  3.9   2.62  16.5     0     1     4     4
#> 2     4  22.8   108    93  3.85  2.32  18.6     1     1     4     1
#> 3     8  18.7   360   175  3.15  3.44  17.0     0     0     3     2

mtcars %>% 
  group_by(cyl) %>% 
  by_group(first_row, to_chr = TRUE)
#> # A tibble: 3 x 11
#>     cyl mpg   disp  hp    drat  wt    qsec  vs    am    gear  carb 
#>   <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1     6 21    160   110   3.9   2.62  16.46 0     1     4     4    
#> 2     4 22.8  108   93    3.85  2.32  18.61 1     1     4     1    
#> 3     8 18.7  360   175   3.15  3.44  17.02 0     0     3     2

# This doesn't solve the problem with `cyl` but makes the code clearer.

Created on 2018-07-14 by the reprex package (v0.2.0.9000).

mauro_lepore · July 14, 2018, 3:06pm

Here is a polished version of my solution -- now preserving groups.

Helpfile: https://forestgeo.github.io/fgeo.tool/reference/by_group.html
Tests: https://github.com/forestgeo/fgeo.tool/blob/master/tests/testthat/test-by_group.R

tjmahr · July 14, 2018, 3:12pm

There's the option of skipping the nest() step because you want to modify the grouping/nesting variable as well. This is a straightforward split-apply-combine approach.

mtcars %>% 
  group_by(cyl) %>%
  split(group_indices(.)) %>% 
  purrr::map_df(first_row, to_chr = TRUE) %>% 
  ungroup()
#> # A tibble: 3 x 11
#>   mpg   cyl   disp  hp    drat  wt    qsec  vs    am    gear  carb 
#>   <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 22.8  4     108   93    3.85  2.32  18.61 1     1     4     1    
#> 2 21    6     160   110   3.9   2.62  16.46 0     1     4     4    
#> 3 18.7  8     360   175   3.15  3.44  17.02 0     0     3     2

tjmahr · July 14, 2018, 3:23pm

Now that I think about it, maybe a nest_by() helper or something would help where you nest() using grouping variables but the columns outside of the list column are indices. Like:

# Hypothetical code

mtcars %>% 
  nest_by(cyl)
#> # A tibble: 3 x 2
#>   .nest_id data              
#>      <int> <list>            
#> 1        2 <tibble [7 x 11]> 
#> 2        1 <tibble [11 x 11]>
#> 3        3 <tibble [14 x 11]>

(By the way, to do this mocked-up function and output above, I used:

library(tidyverse)
mtcars %>% 
  group_by(cyl) %>%
  tibble::add_column(.nest_id = group_indices(.)) %>% 
  ungroup() %>% 
  nest(-.nest_id)
#> # A tibble: 3 x 2
#>   .nest_id data              
#>      <int> <list>            
#> 1        2 <tibble [7 x 11]> 
#> 2        1 <tibble [11 x 11]>
#> 3        3 <tibble [14 x 11]>

)

mauro_lepore · July 14, 2018, 3:51pm

Thanks @tjmahr (https://forestgeo.github.io/fgeo.tool/reference/by_group.html#acknowledgments)! This idea helps me a lot and allows me to not-depend on tidyr (I'm already importing purrr and dplyr).

alistaire · July 14, 2018, 11:49pm

Should I move away from do() and rowwise()?

by_group <- function(.x, .f, ...) {
  grouped <- tidyr::nest(.x)
  out <- dplyr::mutate(grouped, data = purrr::map(.data$data, .f, ...))
  tidyr::unnest(out)
}

There used to be by_slice and by_row functions in purrr. They were super powerful, but complex and the implementation was slow compared to equivalent methods. They still exist in their new home in purrrlyr, but I don't believe they will see further development there, as the nest/list columns approach is now the preferred idiom.

mauro_lepore · July 16, 2018, 6:06pm

FYI: dplyr::nest_by() is comming:

github.com

tidyverse/dplyr/blob/d3ded01ac854cbc847afcb1434f49118cea852e8/R/nest_by.R#L25-L33


      
          #' @examples
          #' starwars %>%
          #'   nest_by(species, homeworld)
          #'
          #' starwars %>%
          #'   nest_by_at(vars(ends_with("_color")))
          #'
          #' starwars %>%
          #'   nest_by_if(is.numeric)