Prediction intervals with tidymodels, best practices?

I put the above approach into a couple rough/quick functions: prep_interval() that is set-up to take in a workflow (with a recipe and model specification) and output a list containing objects needed to produce new prediction intervals and then predict_interval() that takes in the output from the above function + new data to produce prediction intervals on. See gist referenced below for documentation. The code below should essentially be equivalent to my prior example with rpart...

library(tidyverse)
library(tidymodels)

set.seed(123)

iris <- as_tibble(iris)
split <- initial_split(iris)

train <- training(split)
test <- testing(split)

dt_mod <- parsnip::decision_tree() %>% 
  set_engine("rpart") %>% 
  set_mode("regression")

dt_rec <- recipe(Sepal.Length ~ Sepal.Width, data = train)

dt_wf <- workflows::workflow() %>% 
  add_model(dt_mod) %>% 
  add_recipe(dt_rec)

devtools::source_gist("https://gist.github.com/brshallo/3db2cd25172899f91b196a90d5980690")

# Maybe would be better to allow a more custom resamples object as well...
prepped_for_interval <- prep_interval(dt_wf, train)

prepped_for_interval
#> $model_uncertainty
#> # A tibble: 10 x 2
#>    fit      recipe  
#>    <list>   <list>  
#>  1 <fit[+]> <recipe>
#>  2 <fit[+]> <recipe>
#>  3 <fit[+]> <recipe>
#>  4 <fit[+]> <recipe>
#>  5 <fit[+]> <recipe>
#>  6 <fit[+]> <recipe>
#>  7 <fit[+]> <recipe>
#>  8 <fit[+]> <recipe>
#>  9 <fit[+]> <recipe>
#> 10 <fit[+]> <recipe>
#> 
#> $sample_uncertainty
#> # A tibble: 113 x 1
#>     .resid
#>      <dbl>
#>  1  1.25  
#>  2 -0.0444
#>  3  0.256 
#>  4 -0.100 
#>  5  1.75  
#>  6  0.556 
#>  7 -0.543 
#>  8 -0.453 
#>  9  0.947 
#> 10 -0.443 
#> # ... with 103 more rows

pred_interval <- predict_interval(prepped_for_interval, test, probs = c(0.05, 0.95)) 

pred_interval
#> # A tibble: 37 x 2
#>    probs_0.05 probs_0.95
#>         <dbl>      <dbl>
#>  1       4.26       7.31
#>  2       4.00       7.02
#>  3       3.90       6.82
#>  4       4.40       7.69
#>  5       3.71       6.73
#>  6       4.00       7.01
#>  7       4.26       7.29
#>  8       3.70       6.74
#>  9       4.54       7.88
#> 10       3.91       7.26
#> # ... with 27 more rows

Created on 2021-03-04 by the reprex package (v0.3.0)

@Max the correct approach may be to lean on research in conformal prediction / inference. I pasted a few resources I skimmed below, though need to look into more closely (it seems like much of the research here comes out of either Carnegie Mellon or Royal Holloway University, London):

Resources suggest some methods may have high computation costs (e.g. jackknife+), others less so (e.g. split-conformal)... but again, need to read more closely.

1 Like