Pros and cons of split() vs nest() with map() workflows

tidyr
purrr

#1

In examples using the purrr::map() family of functions, I see both split() and nest() being used for generating the inputs to the map function.

Questions:

  • What are the pros and cons of the two approaches?
  • Should either approach preferred or recommended over the other in general or for particular problem-types?

Context:
I am helping develop best-practices for my R-using colleagues, most of whom are new to purrr, so I want to get them started in the "right" direction. I've thought about it, but can't come up with a good reason to suggest one over the other, so I wanted to see if the community could provide some of their opinions or reasons for their preference.

My personal preference is nest(), and it seems to me to be slightly more flexible and transparent - albeit a bit more verbose. However, that preference may be due to my underdeveloped "base" R skills (e.g. direct manipulation of lists).

If this question is too "opinion-y" for this forum, I'm happy to withdraw it. :grin:

Here is an example of almost the same analysis done with both approaches. (I couldn't quickly figure out how to get a column with cyl in the output of the split()-based method.)

> library(dplyr)
> library(tidyr)
> library(purrr)
> 
> # use nest() 
> mtcars %>% 
+   group_by(cyl) %>% 
+   nest() %>% 
+   mutate(mod_obj   = map(data, ~lm(mpg ~ wt, data = .x)),
+          summaries = map(mod_obj, broom::glance)) %>%
+   select(cyl, summaries) %>% 
+   unnest(summaries)
# A tibble: 3 x 12
    cyl r.squared adj.r.squared    sigma statistic    p.value    df    logLik      AIC      BIC  deviance df.residual
  <dbl>     <dbl>         <dbl>    <dbl>     <dbl>      <dbl> <int>     <dbl>    <dbl>    <dbl>     <dbl>       <int>
1     6 0.4645102     0.3574122 1.165202  4.337245 0.09175766     2  -9.82518 25.65036 25.48809  6.788481           5
2     4 0.5086326     0.4540362 3.332283  9.316233 0.01374278     2 -27.74487 61.48974 62.68342 99.936983           9
3     8 0.4229655     0.3748793 2.024091  8.795985 0.01179281     2 -28.65778 63.31555 65.23272 49.163336          12
> 
> # use split()
> mtcars %>% 
+   split(.$cyl) %>% 
+   map(~lm(mpg ~ wt, data = .)) %>% 
+   map(~broom::glance(.)) %>% 
+   reduce(bind_rows)
  r.squared adj.r.squared    sigma statistic    p.value df    logLik      AIC      BIC  deviance df.residual
1 0.5086326     0.4540362 3.332283  9.316233 0.01374278  2 -27.74487 61.48974 62.68342 99.936983           9
2 0.4645102     0.3574122 1.165202  4.337245 0.09175766  2  -9.82518 25.65036 25.48809  6.788481           5
3 0.4229655     0.3748793 2.024091  8.795985 0.01179281  2 -28.65778 63.31555 65.23272 49.163336          12

#2

Personally I like the nest() method as well. I think the big benefit to the nest method is that you can keep everything organized nicely. Looking at your example, the noticeable difference is that the nest method kept which cyl each model results was for. To take it one step further, say you wanted to get both the broom::glance() output and the broom::tidy() results for each model. This is easy to do and keep organized with nest():

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(purrr)

# use nest() assigning it to model_results
model_results <- mtcars %>% 
 group_by(cyl) %>% 
 nest() %>% 
 mutate(mod_obj = map(data, ~lm(mpg ~ wt, data = .x)),
        summaries = map(mod_obj, broom::glance),
        model_coef = map(mod_obj, broom::tidy)) 
  
model_results
#> # A tibble: 3 x 5
#>     cyl data               mod_obj  summaries             model_coef      
#>   <dbl> <list>             <list>   <list>                <list>          
#> 1    6. <tibble [7 x 10]>  <S3: lm> <data.frame [1 x 11]> <data.frame [2 ~
#> 2    4. <tibble [11 x 10]> <S3: lm> <data.frame [1 x 11]> <data.frame [2 ~
#> 3    8. <tibble [14 x 10]> <S3: lm> <data.frame [1 x 11]> <data.frame [2 ~
  
# now we can access both the model summaries AND 
# the model coeffiencts
model_results %>% 
  unnest(summaries, .drop = TRUE)
#> # A tibble: 3 x 12
#>     cyl r.squared adj.r.squared sigma statistic p.value    df logLik   AIC
#>   <dbl>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl>
#> 1    6.     0.465         0.357  1.17      4.34  0.0918     2  -9.83  25.7
#> 2    4.     0.509         0.454  3.33      9.32  0.0137     2 -27.7   61.5
#> 3    8.     0.423         0.375  2.02      8.80  0.0118     2 -28.7   63.3
#> # ... with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>
  

model_results %>% 
  unnest(model_coef, .drop = TRUE)
#> # A tibble: 6 x 6
#>     cyl term        estimate std.error statistic    p.value
#>   <dbl> <chr>          <dbl>     <dbl>     <dbl>      <dbl>
#> 1    6. (Intercept)    28.4      4.18       6.79 0.00105   
#> 2    6. wt             -2.78     1.33      -2.08 0.0918    
#> 3    4. (Intercept)    39.6      4.35       9.10 0.00000777
#> 4    4. wt             -5.65     1.85      -3.05 0.0137    
#> 5    8. (Intercept)    23.9      3.01       7.94 0.00000405
#> 6    8. wt             -2.19     0.739     -2.97 0.0118

Created on 2018-03-23 by the reprex package (v0.2.0).

While you can certainly do all of this with the split method. The nest method allows for easier organization of more complex operations and pipelines.

As for the appropriateness of the post, I think that this is perfect for this forum. One of the main purposes is for R/tidyverse users to have these exact sorts of discussions!


#3

This pull request and the linked blog posts discuss the split vs nest choice:


#4

Hi @tbradley,

Fully agree that nest() and map() is a powerful combination.
Recently I have got somehow annoyed with it as the it negatively impacts speed of my operations. My next step is to try Parallel processing with Multidplyr package to see if that improves speed significantly. In your experience is nest() and map() the most powerful combination for similar types of analysis as above or does any other packages offer similar approach but with improved performance?