Prediction intervals with tidymodels, best practices?

brshallo · March 4, 2021, 1:46pm

I put the above approach into a couple rough/quick functions: prep_interval() that is set-up to take in a workflow (with a recipe and model specification) and output a list containing objects needed to produce new prediction intervals and then predict_interval() that takes in the output from the above function + new data to produce prediction intervals on. See gist referenced below for documentation. The code below should essentially be equivalent to my prior example with rpart...

library(tidyverse)
library(tidymodels)

set.seed(123)

iris <- as_tibble(iris)
split <- initial_split(iris)

train <- training(split)
test <- testing(split)

dt_mod <- parsnip::decision_tree() %>% 
  set_engine("rpart") %>% 
  set_mode("regression")

dt_rec <- recipe(Sepal.Length ~ Sepal.Width, data = train)

dt_wf <- workflows::workflow() %>% 
  add_model(dt_mod) %>% 
  add_recipe(dt_rec)

devtools::source_gist("https://gist.github.com/brshallo/3db2cd25172899f91b196a90d5980690")

# Maybe would be better to allow a more custom resamples object as well...
prepped_for_interval <- prep_interval(dt_wf, train)

prepped_for_interval
#> $model_uncertainty
#> # A tibble: 10 x 2
#>    fit      recipe  
#>    <list>   <list>  
#>  1 <fit[+]> <recipe>
#>  2 <fit[+]> <recipe>
#>  3 <fit[+]> <recipe>
#>  4 <fit[+]> <recipe>
#>  5 <fit[+]> <recipe>
#>  6 <fit[+]> <recipe>
#>  7 <fit[+]> <recipe>
#>  8 <fit[+]> <recipe>
#>  9 <fit[+]> <recipe>
#> 10 <fit[+]> <recipe>
#> 
#> $sample_uncertainty
#> # A tibble: 113 x 1
#>     .resid
#>      <dbl>
#>  1  1.25  
#>  2 -0.0444
#>  3  0.256 
#>  4 -0.100 
#>  5  1.75  
#>  6  0.556 
#>  7 -0.543 
#>  8 -0.453 
#>  9  0.947 
#> 10 -0.443 
#> # ... with 103 more rows

pred_interval <- predict_interval(prepped_for_interval, test, probs = c(0.05, 0.95)) 

pred_interval
#> # A tibble: 37 x 2
#>    probs_0.05 probs_0.95
#>         <dbl>      <dbl>
#>  1       4.26       7.31
#>  2       4.00       7.02
#>  3       3.90       6.82
#>  4       4.40       7.69
#>  5       3.71       6.73
#>  6       4.00       7.01
#>  7       4.26       7.29
#>  8       3.70       6.74
#>  9       4.54       7.88
#> 10       3.91       7.26
#> # ... with 27 more rows

^{Created on 2021-03-04 by the reprex package (v0.3.0)}

@Max the correct approach may be to lean on research in conformal prediction / inference. I pasted a few resources I skimmed below, though need to look into more closely (it seems like much of the research here comes out of either Carnegie Mellon or Royal Holloway University, London):

ryantibs/conformal: github repo with conformalInference R package and links to relevant articles on distribution-free predictive inference. conformalInferene seems to be set-up not too dissimilarly from set-up above (in that takes in a model generating algorithm as input) -- seems could set-up interface or something similar in a way that is pretty tidy friendly (e.g. add_conformal() ...)
donlnz/nonconformist: python package
Conformal Prediction: Link to Royal Holloway University website by creators of method -- Vladimir Vovk and Alex Gammerman.
Assumption-free prediction intervals for black-box regression algorithms - Aaditya Ramdas (YouTube): professor at CMU giving overview of problem, approaches, and current "state-of-the-art"
Tutorial on conformal inference, Dataiku article, Analytics Vidhya article

Resources suggest some methods may have high computation costs (e.g. jackknife+), others less so (e.g. split-conformal)... but again, need to read more closely.