unnest removes rsample attributes which prevents some analysis on bootstrapped models for grouped data

awprc · July 29, 2021, 2:42pm

This post is inspired by `unnest` removes `rsample` attributes · Issue #688 · tidyverse/tidyr · GitHub.
The post uses code inspired from Bootstrap Confidence Intervals • rsample.

Let's say I want to create bootstraps, fit models to those, and then analyze the models. That process is documented well. However, I can't seem to perform on grouped data the same post-model analyses that I can perform on ungrouped data. At least not in the context of using functions like int_*.

Let's look at the analysis with one group:

library(tidymodels)
library(nlstools)

# data
O2Ka <- O2K %>% mutate(group = "A")

ggplot(O2Ka, aes(x = t, y = VO2)) + 
    geom_point()

# build formula
nlin_formula <-  
    as.formula(
        VO2 ~ (t <= 5.883) * VO2rest + 
            (t > 5.883) * 
            (VO2rest + (VO2peak - VO2rest) * (1 - exp(-(t - 5.883) / mu)))
    )

# Starting values from visual inspection
start_values <- list(VO2rest = 400, VO2peak = 1600, mu = 1)

single_nls <- nls(nlin_formula, start = start_values, data = O2Ka) 

tidy(single_nls)

# Will be used to fit the models to different bootstrap data sets:
fit_nls_to_bootstraps <- function(split, ...) {
    # We could check for convergence, make new parameters, etc.
    nls(formula = nlin_formula,
        data = analysis(split),
        start = start_values, ...)
}

set.seed(462)

nlin_boot_no_grouping <-
    bootstraps(O2Ka, times = 1000, apparent = TRUE) %>%
    mutate(models = map(splits, fit_nls_to_bootstraps),
           coefs = map(models, tidy))

nlin_boot_no_grouping
# Notice it says "# Bootstrap sampling with apparent sample" above table output

ci_no_grouping <- int_pctl(nlin_boot_no_grouping, coefs)
ci_no_grouping

That last step of using int_pctl() works fine.

But try this on grouped data, and it throws an error.

# Let's have two groups now (even if just using the same data frame)

O2Kb <- O2K %>% mutate(group = "B")

O2Kg <- bind_rows(O2Ka, O2Kb)

# need bootstrap function to make it work in the nested workflow
bootstrapped <- function(x) bootstraps(x, times = 1000, apparent = TRUE)

set.seed(462)

nlin_boot_nested_grouping <- 
    O2Kg %>% 
    group_nest(group) %>% 
    mutate(straps = map(data, bootstrapped)) %>%
    # the next unnest step seems to remove the rsample attributes of the bootstraps
    unnest(straps) %>% 
    mutate(models = map(splits, fit_nls_to_bootstraps),
           coefs = map(models, tidy))

nlin_boot_nested_grouping
# compare to
nlin_boot_no_grouping

# Percentile intervals don't work
ci_nested_grouping <- int_pctl(nlin_boot_nested_grouping, coefs)
# Error `.data` should be an `rset` object generated from `bootstraps()`

There are work arounds, of course, but I wanted to see if anyone else has encoutered this problematic workflow as I can't seem to find much on it through internet searches.

One possible workaround:

nlin_boot_nested_grouping %>%
        unnest(coefs) %>%
        group_by(group, term) %>%
        summarise(lwr_CI = quantile(estimate, 0.025),
                  estimate = median(estimate),
                  upr_CI = quantile(estimate, 0.975))

arthur.t · July 29, 2021, 5:21pm

Are you using grouping just because you want separate models per group?

You can try to group_by(...) %>% group_split() instead to create a list of data sets, then map your model fitting function over the list.

awprc · July 29, 2021, 6:46pm

So I think what you are suggesting is that instead of creating a list within a data frame through the typical nested approach, to create a standalone list through group_split().

O2Kg %>% 
    group_split(group) %>% 
    map(bootstrapped) %>% 
    map(~ map(.$splits, fit_nls_to_bootstraps)) %>% 
    map(~ map(., tidy))

My issue with this approach is that I'm creating lots of lists that are not organized into a tidy data frame. So once I get to the end to summarize results or use int_pctl(), all I have is a list of model coefficients and no original models or bootstraps to refer to as I would with mutate in a data frame.

My other issue with this approach is that it has to use map inside map in order to apply, for example, the fitting function to each split in a list of splits in each list of splits in the list of data sets.

Or maybe you could share code on what you meant?

arthur.t · July 29, 2021, 7:11pm

Something like this? Maybe this is similar to what you've been doing. Storing data and models in a tibble is very powerful for creating lots of different models, although granted, you have to use a lot of map calls.

library(tidyverse)
library(modelr)

iris_mod <- iris %>%
  nest(data = -Species) %>%
  mutate(
    form  = list(Sepal.Length ~ .),
    model = map2(form, data, lm),
    coeff = map(model, broom::tidy),
    rmse  = map2_dbl(model, data, rmse)
  )

iris_mod
#> # A tibble: 3 x 6
#>   Species    data              form      model  coeff             rmse
#>   <fct>      <list>            <list>    <list> <list>           <dbl>
#> 1 setosa     <tibble [50 x 4]> <formula> <lm>   <tibble [4 x 5]> 0.227
#> 2 versicolor <tibble [50 x 4]> <formula> <lm>   <tibble [4 x 5]> 0.321
#> 3 virginica  <tibble [50 x 4]> <formula> <lm>   <tibble [4 x 5]> 0.305

iris_mod %>% select(Species, coeff) %>% unnest(coeff)
#> # A tibble: 12 x 6
#>    Species    term         estimate std.error statistic  p.value
#>    <fct>      <chr>           <dbl>     <dbl>     <dbl>    <dbl>
#>  1 setosa     (Intercept)     2.35     0.393      5.99  3.03e- 7
#>  2 setosa     Sepal.Width     0.655    0.0924     7.08  6.83e- 9
#>  3 setosa     Petal.Length    0.238    0.208      1.14  2.59e- 1
#>  4 setosa     Petal.Width     0.252    0.347      0.727 4.71e- 1
#>  5 versicolor (Intercept)     1.90     0.507      3.74  5.11e- 4
#>  6 versicolor Sepal.Width     0.387    0.205      1.89  6.49e- 2
#>  7 versicolor Petal.Length    0.908    0.165      5.49  1.67e- 6
#>  8 versicolor Petal.Width    -0.679    0.435     -1.56  1.26e- 1
#>  9 virginica  (Intercept)     0.700    0.534      1.31  1.96e- 1
#> 10 virginica  Sepal.Width     0.330    0.174      1.89  6.44e- 2
#> 11 virginica  Petal.Length    0.946    0.0907    10.4   1.07e-13
#> 12 virginica  Petal.Width    -0.170    0.198     -0.857 3.96e- 1

^{Created on 2021-07-29 by the reprex package (v1.0.0)}

awprc · July 29, 2021, 7:24pm

I appreciate your help. I know how to use nest and mutate. But I'm mostly interested in getting the int_pctl()call to work in the bootstrap workflow on multiple groups within a dataset. The tutorial in the link uses bootstraps through rsample, but fails to do so on grouped data.

The iris reprex you provided skips any sort of bootstrapping.

arthur.t · July 29, 2021, 7:48pm

Maybe try something like this. Keep everything nested and use a map inside a map?

library(tidyverse)
library(modelr)

iris_mod <- iris %>%
  nest(data = -Species) %>%
  mutate(
    form  = list(Sepal.Length ~ .),
    boot_data = map(data, bootstrap, 10),
    boot_model = map2(form %>% map(list),
                      boot_data %>% map(pluck, "strap"), 
                      map2, 
                      lm),   # this is a little crazy...
    boot_coeff = boot_model %>% map(map_dfr, broom::tidy)
    #boot_coef2 = map2(boot_data, boot_coeff, int_pctl)  # try something like this
  )

iris_mod
#> # A tibble: 3 x 6
#>   Species    data            form     boot_data       boot_model boot_coeff     
#>   <fct>      <list>          <list>   <list>          <list>     <list>         
#> 1 setosa     <tibble [50 x ~ <formul~ <tibble [10 x ~ <list [10~ <tibble [40 x ~
#> 2 versicolor <tibble [50 x ~ <formul~ <tibble [10 x ~ <list [10~ <tibble [40 x ~
#> 3 virginica  <tibble [50 x ~ <formul~ <tibble [10 x ~ <list [10~ <tibble [40 x ~

^{Created on 2021-07-29 by the reprex package (v1.0.0)}

arthur.t · July 29, 2021, 7:50pm

I don't have tidymodels installed currently so I couldn't test it.

nirgrahamuk · July 30, 2021, 11:53am

I think you could do something along these lines :


set.seed(462)

nlin_boot_grouping <- 
  O2Kg %>% 
  group_split(group) %>% set_names(unique(O2Kg$group)) %>% map(
  ~{
    bootstraps(.x, times = 1000, apparent = TRUE) %>%
    mutate(models = map(splits, fit_nls_to_bootstraps)) %>%
    mutate(  coefs = map(models, tidy))
   })

nlin_boot_nested_grouping

ci_grouping <- imap_dfr(nlin_boot_grouping ,
                            ~int_pctl(.x, coefs) %>% 
                                 mutate(group=.y) %>% 
                                         relocate(group))

ci_grouping

awprc · July 30, 2021, 2:08pm

@nirgrahamuk This is very close to what I was looking for. I had not learned yet about the imap_* functions, and so I think your solution is closest to what I had in mind.

Thank you.

system · August 20, 2021, 2:09pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.