In examples using the purrr::map()
family of functions, I see both split()
and nest()
being used for generating the inputs to the map
function.
Questions:
- What are the pros and cons of the two approaches?
- Should either approach preferred or recommended over the other in general or for particular problem-types?
Context:
I am helping develop best-practices for my R-using colleagues, most of whom are new to purrr
, so I want to get them started in the "right" direction. I've thought about it, but can't come up with a good reason to suggest one over the other, so I wanted to see if the community could provide some of their opinions or reasons for their preference.
My personal preference is nest()
, and it seems to me to be slightly more flexible and transparent - albeit a bit more verbose. However, that preference may be due to my underdeveloped "base" R skills (e.g. direct manipulation of lists).
If this question is too "opinion-y" for this forum, I'm happy to withdraw it.
Here is an example of almost the same analysis done with both approaches. (I couldn't quickly figure out how to get a column with cyl
in the output of the split()
-based method.)
> library(dplyr)
> library(tidyr)
> library(purrr)
>
> # use nest()
> mtcars %>%
+ group_by(cyl) %>%
+ nest() %>%
+ mutate(mod_obj = map(data, ~lm(mpg ~ wt, data = .x)),
+ summaries = map(mod_obj, broom::glance)) %>%
+ select(cyl, summaries) %>%
+ unnest(summaries)
# A tibble: 3 x 12
cyl r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int>
1 6 0.4645102 0.3574122 1.165202 4.337245 0.09175766 2 -9.82518 25.65036 25.48809 6.788481 5
2 4 0.5086326 0.4540362 3.332283 9.316233 0.01374278 2 -27.74487 61.48974 62.68342 99.936983 9
3 8 0.4229655 0.3748793 2.024091 8.795985 0.01179281 2 -28.65778 63.31555 65.23272 49.163336 12
>
> # use split()
> mtcars %>%
+ split(.$cyl) %>%
+ map(~lm(mpg ~ wt, data = .)) %>%
+ map(~broom::glance(.)) %>%
+ reduce(bind_rows)
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
1 0.5086326 0.4540362 3.332283 9.316233 0.01374278 2 -27.74487 61.48974 62.68342 99.936983 9
2 0.4645102 0.3574122 1.165202 4.337245 0.09175766 2 -9.82518 25.65036 25.48809 6.788481 5
3 0.4229655 0.3748793 2.024091 8.795985 0.01179281 2 -28.65778 63.31555 65.23272 49.163336 12