Prediction intervals with tidymodels, best practices?

What is the best practice for producing prediction intervals (not confidence intervals) for predictions using tidymodels (would prefer genralizable approach or at least across more than just linear regression and use of simulation methods when appropriate).

I read through these: https://github.com/tidymodels/parsnip/search?q=prediction+intervals&type=issues -- but got a little lost as seemed like was rolled into parsnip::predict() but also doesn't seem to have full set of features disucssed by Alex Hayes
Max in initial discussion... I was unclear exactly if this exists somewhere or will exist in the future.

I'm also interested if there are any good tidy tutorials out there on prediction intervals already that use existing tools (e.g. that take advantage of rsample, broom, etc).

(Link to helpful blog that breaks down sources of noise for bootstrapping prediction intervals: https://saattrupdan.github.io/2020-03-01-bootstrap-prediction/ , but code examples are in python).

This question was posted previously on the R4DS Online Learning Community Slack channel: https://rfordatascience.slack.com/archives/C8JSHANJY/p1601014414012500

1 Like

Have you looked at tidypredict?

1 Like

parsnip can produce them for model types that naturally make them.

Otherwise... that paper that it cited in the blog post (and its references) are doable. I would not use the 632 method here since, for some models, the apparent error rate is 0. I would also use 10-fold CV to get the residuals too (instead of re-predicting the training set).

I have some old caret code laying around to do this (only for regression models). However, as the blog shows, you need to do a large number of bootstrap fits of the model to get good coverage and stability.

A long while back I tried this out at my previous job. There were some issues about the generality of the resampling approach. Imagine a CART tree fit. Outside of the data range, the predictions are flat on either side of the distribution of x and intervals can be really misleading there.

I've got a lot going on currently but I'll try to create a gist that does some of this with rsample and parsnip. I won't support it so use it at your own risk. I'll give a link here if/when I have that working.

5 Likes

Thank you!

parsnip can produce them for model types that naturally make them

I am assuming below is a way to check which parsnip supported models naturally make prediction intervals:

library(tidyverse)

envir <- parsnip::get_model_env()

ls(envir) %>% 
  tibble(name = .) %>% 
  filter(str_detect(name, "_predict")) %>% 
  mutate(prediction_modules  = map(name, parsnip::get_from_env)) %>% 
  unnest(prediction_modules) %>% 
  filter(str_detect(type, "pred_int"))
#> # A tibble: 3 x 5
#>   name                 engine mode           type     value           
#>   <chr>                <chr>  <chr>          <chr>    <list>          
#> 1 linear_reg_predict   lm     regression     pred_int <named list [4]>
#> 2 linear_reg_predict   stan   regression     pred_int <named list [4]>
#> 3 logistic_reg_predict stan   classification pred_int <named list [4]>

Created on 2020-09-29 by the reprex package (v0.3.0)

Code above is slightly modified from this github comment.

Yes, that works fine.

I have managed to get a bootstrap working for logistic fits if you are interested.

1 Like

Yes, please share link.

The function fit_logistic in https://github.com/alankjackson/Covid/blob/master/app.R is where I do the bootstrap. Hopefully it will make sense. Take a look and let me know if you have questions. It's been a few months, but I remember it being relatively painful to get it working well. Making it robust against failure was the real challenge.

I walked through fit_logistic() at your example. It looks like you are returning the bootstrapped confidence intervals, I am interested in the prediction intervals.

@Max
did you have reference related to below topic like paper

Outside of the data range, the predictions are flat on either side of the distribution of x and intervals can be really misleading there.

No; it's just the nature of that type of model. In 1D, it is a set of step functions.

@Max I got a little lost trying to work through the example in the paper / blog I'd shared :sweat_smile: (https://saattrupdan.github.io/2020-03-01-bootstrap-prediction/ ) and may have messed-up trying to encode it... :-/

  • I stuck with bootstrapping for the model variance component
  • Used cross validation for the variance in the residuals (per your suggestion) so skipped the whole adjustment around 632 and training/validation error

Here is a quick (attempted) example with decision trees and 90% prediction interval:

library(tidyverse)
library(tidymodels)

set.seed(123)

iris <- as_tibble(iris)
split <- initial_split(iris)

train <- training(split)
test <- testing(split)

model_variance <- rsample::bootstraps(train, sqrt(nrow(test))) %>%
  mutate(mod = map(splits, ~ rpart::rpart(Sepal.Length ~ Sepal.Width, data = analysis(.x))),
         test_preds = map(mod, ~ tibble(
           pred = predict(.x, test),
           index = seq_len(nrow(test))
         ))) %>%
  unnest(test_preds) %>%
  group_by(index) %>%
  mutate(m = pred - mean(pred)) %>%
  select(index, m, pred)

sampling_variance <- rsample::vfold_cv(train, 10) %>%
  mutate(
    mod = map(splits, ~ rpart::rpart(Sepal.Length ~ Sepal.Width, data = analysis(.x))),
    resids = map2(
      mod,
      splits,
      ~ predict(.x, assessment(.y)) - assessment(.y)$Sepal.Length
    )
  ) %>%
  select(resids) %>%
  unnest(resids) 

tidyr::crossing(model_variance, sampling_variance) %>% 
  mutate(c = m + resids + pred) %>% 
  group_by(index) %>% 
  summarise(q_05 = quantile(c, 0.05),
            q_95 = quantile(c, 0.95)) %>% 
  bind_cols(test, rpart::rpart(Sepal.Length ~ Sepal.Width, data = train) %>% predict(test) %>% tibble(.pred = .)) %>% 
  ggplot(aes(x = Sepal.Width))+
  geom_point(aes(y = Sepal.Length))+
  geom_line(aes(y = q_05), colour = "red")+
  geom_line(aes(y = q_95), colour = "red")+
  geom_line(aes(y = .pred), colour = "blue")+
  ylim(c(3, 9))

Created on 2021-02-12 by the reprex package (v0.3.0)

Could functionalize and edit so could take in a parsnip model specification and a recipe (but just wanted to get down a first attempt as wasn't sure if was doing correctly... :face_with_head_bandage:).

Other resources I found were this more simple approach:

Which essentially just uses randomized residuals (I imagine you would recommend using an alternative method for sampling the residuals here instead of pulling them from the model training/analysis set) -- not sure if this method may only be appropriate in linear models though...? Then there is the similar approach on Cross Validated which uses variance adjusted residuals (again assuming confined to linear models, and in this case relies on measure of influence).

@brshallo I'm conflicted because, TBH, I'm not at all sold on this method. I definitely think that you need thousands of resamples to do this. I have some results below that show that the bands seem to be more narrow than the ones that we can compute analytically.

I do have some code for regression models but it's still not ready to be public. I'll try to put it in a gist soon.

Example:

library(tidyverse)
library(tidymodels)

set.seed(123)

iris <- as_tibble(iris)
split <- initial_split(iris)

train <- training(split)
test <- testing(split)

model_variance <- rsample::bootstraps(train, 5000) %>%
 mutate(mod = map(splits, ~ lm(Sepal.Length ~ Sepal.Width, data = analysis(.x))),
        test_preds = map(mod, ~ tibble(
         pred = predict(.x, test),
         index = seq_len(nrow(test))
        ))) %>%
 unnest(test_preds) %>%
 group_by(index) %>%
 mutate(m = pred - mean(pred)) %>%
 select(index, m, pred)

sampling_variance <- rsample::vfold_cv(train, 10) %>%
 mutate(
  mod = map(splits, ~ lm(Sepal.Length ~ Sepal.Width, data = analysis(.x))),
  resids = map2(
   mod,
   splits,
   ~ predict(.x, assessment(.y)) - assessment(.y)$Sepal.Length
  )
 ) %>%
 select(resids) %>%
 unnest(resids) 

lm_mod <- lm(Sepal.Length ~ Sepal.Width, data = iris)
analytical <- 
 predict(lm_mod, newdata = test, interval = "prediction") %>% 
 tibble::as_tibble() %>% 
 bind_cols(test)

tidyr::crossing(model_variance, sampling_variance) %>% 
 mutate(c = m + resids + pred) %>% 
 group_by(index) %>% 
 summarise(q_05 = quantile(c, 0.05),
           q_95 = quantile(c, 0.95)) %>% 
 bind_cols(test, lm(Sepal.Length ~ Sepal.Width, data = train) %>% predict(test) %>% tibble(.pred = .)) %>% 
 ggplot(aes(x = Sepal.Width))+
 geom_point(aes(y = Sepal.Length))+
 geom_line(aes(y = q_05), colour = "red")+
 geom_line(aes(y = q_95), colour = "red")+
 geom_line(aes(y = .pred), colour = "blue") + 
 geom_line(data = analytical, aes(y = lwr), col = "red", lty = 2) + 
 geom_line(data = analytical, aes(y = upr), col = "red", lty = 2) + 
 ylim(c(3, 9))

Created on 2021-02-12 by the reprex package (v1.0.0.9000)

Interested to see what you put together! Also, just wanted to point out that some of the difference in your example is due to me using a 90% prediction interval rather than 95%, the default level for predict() (sorry for using a weird alpha :sweat_smile:).

If you change the quantile() arguments to 0.025 and 0.975 respectively, you'll get:
image

Ah ok. My bad. Looks better

Thanks for the reply and the link! I did check here but seemed to only be for parametric approach and also seemed to only be for linear models:


(Scrolling to bottom)
image

I would be interested in an example that uses simulation based approach as well as example for non-linear model.

did you have reference for this topic