Pros and cons of split() vs nest() with map() workflows

tbradley · March 23, 2018, 7:37pm

Personally I like the nest() method as well. I think the big benefit to the nest method is that you can keep everything organized nicely. Looking at your example, the noticeable difference is that the nest method kept which cyl each model results was for. To take it one step further, say you wanted to get both the broom::glance() output and the broom::tidy() results for each model. This is easy to do and keep organized with nest():

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(purrr)

# use nest() assigning it to model_results
model_results <- mtcars %>% 
 group_by(cyl) %>% 
 nest() %>% 
 mutate(mod_obj = map(data, ~lm(mpg ~ wt, data = .x)),
        summaries = map(mod_obj, broom::glance),
        model_coef = map(mod_obj, broom::tidy)) 
  
model_results
#> # A tibble: 3 x 5
#>     cyl data               mod_obj  summaries             model_coef      
#>   <dbl> <list>             <list>   <list>                <list>          
#> 1    6. <tibble [7 x 10]>  <S3: lm> <data.frame [1 x 11]> <data.frame [2 ~
#> 2    4. <tibble [11 x 10]> <S3: lm> <data.frame [1 x 11]> <data.frame [2 ~
#> 3    8. <tibble [14 x 10]> <S3: lm> <data.frame [1 x 11]> <data.frame [2 ~
  
# now we can access both the model summaries AND 
# the model coeffiencts
model_results %>% 
  unnest(summaries, .drop = TRUE)
#> # A tibble: 3 x 12
#>     cyl r.squared adj.r.squared sigma statistic p.value    df logLik   AIC
#>   <dbl>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl>
#> 1    6.     0.465         0.357  1.17      4.34  0.0918     2  -9.83  25.7
#> 2    4.     0.509         0.454  3.33      9.32  0.0137     2 -27.7   61.5
#> 3    8.     0.423         0.375  2.02      8.80  0.0118     2 -28.7   63.3
#> # ... with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>
  

model_results %>% 
  unnest(model_coef, .drop = TRUE)
#> # A tibble: 6 x 6
#>     cyl term        estimate std.error statistic    p.value
#>   <dbl> <chr>          <dbl>     <dbl>     <dbl>      <dbl>
#> 1    6. (Intercept)    28.4      4.18       6.79 0.00105   
#> 2    6. wt             -2.78     1.33      -2.08 0.0918    
#> 3    4. (Intercept)    39.6      4.35       9.10 0.00000777
#> 4    4. wt             -5.65     1.85      -3.05 0.0137    
#> 5    8. (Intercept)    23.9      3.01       7.94 0.00000405
#> 6    8. wt             -2.19     0.739     -2.97 0.0118

Created on 2018-03-23 by the reprex package (v0.2.0).

While you can certainly do all of this with the split method. The nest method allows for easier organization of more complex operations and pipelines.

As for the appropriateness of the post, I think that this is perfect for this forum. One of the main purposes is for R/tidyverse users to have these exact sorts of discussions!