Multiple estimates/statistics from bootstrapping with rsample?

clausp · March 18, 2020, 5:20pm

I am using rsample to run some bootstraps and are trying to figure out if there is a way of having it calculate multiple statistics. The following works great if I have one statistics:

suppressMessages(library(tidyverse))
library(rsample)

# Bootstrapping with one statistics
set.seed(2)
bootstraps(mtcars, times = 2000, apparent = TRUE) %>% 
  mutate(ratio = map(splits, ~ {
    df <- analysis(.x)
    tibble(
      term = "ratio",
      estimate  = mean((df$carb < 3)),
      std.error = NA_real_
    )
  }
  )) %>% 
  int_pctl(ratio) 
#> # A tibble: 1 x 6
#>   term  .lower .estimate .upper .alpha .method   
#>   <chr>  <dbl>     <dbl>  <dbl>  <dbl> <chr>     
#> 1 ratio  0.375     0.534  0.719   0.05 percentile

^{Created on 2020-03-18 by the reprex package (v0.2.1)}

If I need two different statistics, I can do it the following way, although that means having to merge in the actual estimates afterward (not done here):

suppressMessages(library(tidyverse))
library(rsample)

set.seed(2)
bootstraps(mtcars, times = 2000) %>% 
  mutate(ratio = map(splits, ~ {
    df <- analysis(.x)
    tibble(
      estimate_1 = mean((df$carb < 3)),
      estimate_2 = mean(df[df$carb > 3, ]$mpg)
    )
  }
  )) %>% 
  unnest(cols = ratio) %>% 
  summarise(
    est_1_lower = quantile(estimate_1, 0.025),
    est_1_upper = quantile(estimate_1, 0.975),
    est_2_lower = quantile(estimate_2, 0.025),
    est_2_upper = quantile(estimate_2, 0.975)
  )
#> # A tibble: 1 x 4
#>   est_1_lower est_1_upper est_2_lower est_2_upper
#>         <dbl>       <dbl>       <dbl>       <dbl>
#> 1       0.375       0.719        13.9        18.1

^{Created on 2020-03-18 by the reprex package (v0.2.1)}

The following is what I would like to be able to do, but I cannot figure out how to get int_pctl to accept something other than estimate as the variable name.

# This is what I would like
set.seed(2)
bootstraps(mtcars, times = 2000, apparent = TRUE) %>% 
  mutate(ratio = map(splits, ~ {
    df <- analysis(.x)
    tibble(
      term_1 = "ratio",
      estimate_1  = mean((df$carb < 3)),
      std.error_1 = NA_real_,
      term_2 = "mean",
      estimate_2 = mean(df[df$carb > 3, ]$mpg),
      std.error_2 = NA_real_
    )
  }
  )) %>% 
  int_pctl(ratio, mean)

Is it possible to get int_pctl to handle multiple names?

technocrat · March 18, 2020, 7:23pm

This is kind of hard to follow without a full reprex with all the functions defined, such as int_pctl.

Conceptually, however, what you should be thinking of is a function f that takes as its argument some object, such as a data frame and returns a result, which may be an object with multiple variables. Then you bootstrap that object.

Max · March 18, 2020, 11:29pm

The help page has

An unquoted column name or dplyr selector that identifies a single column in the data set that contains the individual bootstrap estimates. This can be a list column of tidy tibbles (that contains columns term and estimate ) or a simple numeric column. For t-intervals, a standard tidy column (usually called std.err ) is required. See the examples below.

Here is an adaptation for your example:

suppressMessages(library(tidyverse))
library(rsample)

compute <- function(split) {
  df <- analysis(split)
  tibble(term = c("low carb", "high carb"),
         estimate = c(mean((df$carb < 3)), mean(df[df$carb > 3,]$mpg)))
}

set.seed(2)
bt <-
  bootstraps(mtcars, times = 2000, apparent = TRUE) %>%
  mutate(ratio = map(splits, ~ compute(.x)))

int_pctl(bt, ratio)
#> # A tibble: 2 x 6
#>   term      .lower .estimate .upper .alpha .method   
#>   <chr>      <dbl>     <dbl>  <dbl>  <dbl> <chr>     
#> 1 high carb 13.9      16.0   18.1     0.05 percentile
#> 2 low carb   0.375     0.534  0.719   0.05 percentile

^{Created on 2020-03-18 by the reprex package (v0.3.0)}

clausp · March 19, 2020, 2:45pm

@Max Thank you so very much! I have read that sentence way too many times, but I always focused on the word "column" and the example shows only one statistics.

I do think the help could be a little clearer. If I try to write up an example based on what you did and do a pull request would that be okay, or would it be easier if you add it directly? I could, for example, do one for the iris data using mean and median (silly and simple, but it would serve to illustrate).

Claus

Max · March 19, 2020, 4:11pm

Please submit a PR. We can always make documentation better

system · March 26, 2020, 4:11pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.