unnest removes rsample attributes which prevents some analysis on bootstrapped models for grouped data

This post is inspired by `unnest` removes `rsample` attributes · Issue #688 · tidyverse/tidyr · GitHub.
The post uses code inspired from Bootstrap Confidence Intervals • rsample.

Let's say I want to create bootstraps, fit models to those, and then analyze the models. That process is documented well. However, I can't seem to perform on grouped data the same post-model analyses that I can perform on ungrouped data. At least not in the context of using functions like int_*.

Let's look at the analysis with one group:

library(tidymodels)
library(nlstools)

# data
O2Ka <- O2K %>% mutate(group = "A")

ggplot(O2Ka, aes(x = t, y = VO2)) + 
    geom_point()

# build formula
nlin_formula <-  
    as.formula(
        VO2 ~ (t <= 5.883) * VO2rest + 
            (t > 5.883) * 
            (VO2rest + (VO2peak - VO2rest) * (1 - exp(-(t - 5.883) / mu)))
    )

# Starting values from visual inspection
start_values <- list(VO2rest = 400, VO2peak = 1600, mu = 1)

single_nls <- nls(nlin_formula, start = start_values, data = O2Ka) 

tidy(single_nls)

# Will be used to fit the models to different bootstrap data sets:
fit_nls_to_bootstraps <- function(split, ...) {
    # We could check for convergence, make new parameters, etc.
    nls(formula = nlin_formula,
        data = analysis(split),
        start = start_values, ...)
}

set.seed(462)

nlin_boot_no_grouping <-
    bootstraps(O2Ka, times = 1000, apparent = TRUE) %>%
    mutate(models = map(splits, fit_nls_to_bootstraps),
           coefs = map(models, tidy))

nlin_boot_no_grouping
# Notice it says "# Bootstrap sampling with apparent sample" above table output

ci_no_grouping <- int_pctl(nlin_boot_no_grouping, coefs)
ci_no_grouping

That last step of using int_pctl() works fine.

But try this on grouped data, and it throws an error.

# Let's have two groups now (even if just using the same data frame)

O2Kb <- O2K %>% mutate(group = "B")

O2Kg <- bind_rows(O2Ka, O2Kb)

# need bootstrap function to make it work in the nested workflow
bootstrapped <- function(x) bootstraps(x, times = 1000, apparent = TRUE)

set.seed(462)

nlin_boot_nested_grouping <- 
    O2Kg %>% 
    group_nest(group) %>% 
    mutate(straps = map(data, bootstrapped)) %>%
    # the next unnest step seems to remove the rsample attributes of the bootstraps
    unnest(straps) %>% 
    mutate(models = map(splits, fit_nls_to_bootstraps),
           coefs = map(models, tidy))

nlin_boot_nested_grouping
# compare to
nlin_boot_no_grouping

# Percentile intervals don't work
ci_nested_grouping <- int_pctl(nlin_boot_nested_grouping, coefs)
# Error `.data` should be an `rset` object generated from `bootstraps()`

There are work arounds, of course, but I wanted to see if anyone else has encoutered this problematic workflow as I can't seem to find much on it through internet searches.

One possible workaround:

nlin_boot_nested_grouping %>%
        unnest(coefs) %>%
        group_by(group, term) %>%
        summarise(lwr_CI = quantile(estimate, 0.025),
                  estimate = median(estimate),
                  upr_CI = quantile(estimate, 0.975))

Are you using grouping just because you want separate models per group?

You can try to group_by(...) %>% group_split() instead to create a list of data sets, then map your model fitting function over the list.

So I think what you are suggesting is that instead of creating a list within a data frame through the typical nested approach, to create a standalone list through group_split().

O2Kg %>% 
    group_split(group) %>% 
    map(bootstrapped) %>% 
    map(~ map(.$splits, fit_nls_to_bootstraps)) %>% 
    map(~ map(., tidy))

My issue with this approach is that I'm creating lots of lists that are not organized into a tidy data frame. So once I get to the end to summarize results or use int_pctl(), all I have is a list of model coefficients and no original models or bootstraps to refer to as I would with mutate in a data frame.

My other issue with this approach is that it has to use map inside map in order to apply, for example, the fitting function to each split in a list of splits in each list of splits in the list of data sets.

Or maybe you could share code on what you meant?

Something like this? Maybe this is similar to what you've been doing. Storing data and models in a tibble is very powerful for creating lots of different models, although granted, you have to use a lot of map calls.

library(tidyverse)
library(modelr)

iris_mod <- iris %>%
  nest(data = -Species) %>%
  mutate(
    form  = list(Sepal.Length ~ .),
    model = map2(form, data, lm),
    coeff = map(model, broom::tidy),
    rmse  = map2_dbl(model, data, rmse)
  )

iris_mod
#> # A tibble: 3 x 6
#>   Species    data              form      model  coeff             rmse
#>   <fct>      <list>            <list>    <list> <list>           <dbl>
#> 1 setosa     <tibble [50 x 4]> <formula> <lm>   <tibble [4 x 5]> 0.227
#> 2 versicolor <tibble [50 x 4]> <formula> <lm>   <tibble [4 x 5]> 0.321
#> 3 virginica  <tibble [50 x 4]> <formula> <lm>   <tibble [4 x 5]> 0.305

iris_mod %>% select(Species, coeff) %>% unnest(coeff)
#> # A tibble: 12 x 6
#>    Species    term         estimate std.error statistic  p.value
#>    <fct>      <chr>           <dbl>     <dbl>     <dbl>    <dbl>
#>  1 setosa     (Intercept)     2.35     0.393      5.99  3.03e- 7
#>  2 setosa     Sepal.Width     0.655    0.0924     7.08  6.83e- 9
#>  3 setosa     Petal.Length    0.238    0.208      1.14  2.59e- 1
#>  4 setosa     Petal.Width     0.252    0.347      0.727 4.71e- 1
#>  5 versicolor (Intercept)     1.90     0.507      3.74  5.11e- 4
#>  6 versicolor Sepal.Width     0.387    0.205      1.89  6.49e- 2
#>  7 versicolor Petal.Length    0.908    0.165      5.49  1.67e- 6
#>  8 versicolor Petal.Width    -0.679    0.435     -1.56  1.26e- 1
#>  9 virginica  (Intercept)     0.700    0.534      1.31  1.96e- 1
#> 10 virginica  Sepal.Width     0.330    0.174      1.89  6.44e- 2
#> 11 virginica  Petal.Length    0.946    0.0907    10.4   1.07e-13
#> 12 virginica  Petal.Width    -0.170    0.198     -0.857 3.96e- 1

Created on 2021-07-29 by the reprex package (v1.0.0)

I appreciate your help. I know how to use nest and mutate. But I'm mostly interested in getting the int_pctl()call to work in the bootstrap workflow on multiple groups within a dataset. The tutorial in the link uses bootstraps through rsample, but fails to do so on grouped data.

The iris reprex you provided skips any sort of bootstrapping.

Maybe try something like this. Keep everything nested and use a map inside a map?

library(tidyverse)
library(modelr)

iris_mod <- iris %>%
  nest(data = -Species) %>%
  mutate(
    form  = list(Sepal.Length ~ .),
    boot_data = map(data, bootstrap, 10),
    boot_model = map2(form %>% map(list),
                      boot_data %>% map(pluck, "strap"), 
                      map2, 
                      lm),   # this is a little crazy...
    boot_coeff = boot_model %>% map(map_dfr, broom::tidy)
    #boot_coef2 = map2(boot_data, boot_coeff, int_pctl)  # try something like this
  )

iris_mod
#> # A tibble: 3 x 6
#>   Species    data            form     boot_data       boot_model boot_coeff     
#>   <fct>      <list>          <list>   <list>          <list>     <list>         
#> 1 setosa     <tibble [50 x ~ <formul~ <tibble [10 x ~ <list [10~ <tibble [40 x ~
#> 2 versicolor <tibble [50 x ~ <formul~ <tibble [10 x ~ <list [10~ <tibble [40 x ~
#> 3 virginica  <tibble [50 x ~ <formul~ <tibble [10 x ~ <list [10~ <tibble [40 x ~

Created on 2021-07-29 by the reprex package (v1.0.0)

I don't have tidymodels installed currently so I couldn't test it.

I think you could do something along these lines :


set.seed(462)

nlin_boot_grouping <- 
  O2Kg %>% 
  group_split(group) %>% set_names(unique(O2Kg$group)) %>% map(
  ~{
    bootstraps(.x, times = 1000, apparent = TRUE) %>%
    mutate(models = map(splits, fit_nls_to_bootstraps)) %>%
    mutate(  coefs = map(models, tidy))
   })

nlin_boot_nested_grouping

ci_grouping <- imap_dfr(nlin_boot_grouping ,
                            ~int_pctl(.x, coefs) %>% 
                                 mutate(group=.y) %>% 
                                         relocate(group))

ci_grouping

@nirgrahamuk This is very close to what I was looking for. I had not learned yet about the imap_* functions, and so I think your solution is closest to what I had in mind.

Thank you.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.