how to use rsample for multilevel resampling

In multilevel modeling, we have observations nested in grouping variables. For example, the lme4::sleepsludy dataset has 10 observations each from 18 subjects. For bootstrapping this data for modeling, it makes sense to resample whole subjects. The best workflow for this procedure using rsample, as far as I know, is the following:

library(rsample)
library(tidyverse)

lme4::sleepstudy |> 
 #resample unique ids 
  distinct(Subject) |> 
  bootstraps(times = 10) |> 
  # attach the original data to the ids
  mutate(
    analysis = lapply(
      splits, 
      function(x) left_join(analysis(x), lme4::sleepstudy, by = "Subject")
    )
  )

Note that this copies the original data several times and is wasteful.

I have tried to make a function that does low-level manipulation of the rset object (replacing the data and in_id fields) but this feels like cheating.

Is there a better way to use bootstraps() to bootstrap chunks of data where the units being resampled may represent multiple rows of data?

1 Like

I don't have a more elegant solution than what you've done. This is basically the same thing I've done in the past when doing resampling on a multi-level data set. I am only chiming in to say that I would love for {rsample} (or an adjacent package) to perhaps support multi-level resampling in a similar way as they have supported time-series sampling in {spatialsample}.

This type of hierarchical resampling occurs a lot for me, and some tidymodels-friendly functions would be a great addition to the ecosystem. Just throwing in my 2 cents in case @Max appears.

Relevant:

Oh, and my only other contribution is that group_vfold_cv() can moonlight for mutli-level loo_cv(), if you group on the multi-level grouping variable. But this doesn't help us for other forms of resampling, such as bootstraps.

2 Likes

I agree that we need more functions like these in rsample. I would go add thumbs up to the GH issues in those repos.

I suspect that group_vfold_cv()`is the best that we have at the moment.

1 Like

@Devin_Pastoor has a nice function here:

It handled IDs/Keys and strata.

Not tidy models compatible but it got the job done.

Hi all!

I just wanted to share that this is now in the development version of rsample:

library(rsample)
library(tidyverse)

set.seed(123)
boot1 <- lme4::sleepstudy |> 
  group_bootstraps(times = 10, Subject)

boot1
#> # Bootstrap sampling 
#> # A tibble: 10 × 2
#>    splits           id         
#>    <list>           <chr>      
#>  1 <split [180/60]> Bootstrap01
#>  2 <split [180/70]> Bootstrap02
#>  3 <split [180/80]> Bootstrap03
#>  4 <split [180/80]> Bootstrap04
#>  5 <split [180/70]> Bootstrap05
#>  6 <split [180/60]> Bootstrap06
#>  7 <split [180/60]> Bootstrap07
#>  8 <split [180/70]> Bootstrap08
#>  9 <split [180/60]> Bootstrap09
#> 10 <split [180/60]> Bootstrap10

unique(analysis(boot1$splits[[1]])$Subject)
#>  [1] 308 309 330 333 335 337 349 350 352 369 370 372
#> 18 Levels: 308 309 310 330 331 332 333 334 335 337 349 350 351 352 369 ... 372
unique(assessment(boot1$splits[[1]])$Subject)
#> [1] 310 331 332 334 351 371
#> 18 Levels: 308 309 310 330 331 332 333 334 335 337 349 350 351 352 369 ... 372

Created on 2022-06-30 by the reprex package (v2.0.1)

This won't be on CRAN for a few months, but is in the GitHub version.

1 Like