I've got a scenario where I've trained a glmnet model, but have a stakeholder wanting to see the confidence interval around the prediction. glmnet doesn't provide this, and there are a few good stack exchange & github threads about why this is the case for regularized models. I understand why this isn't available for an individual model, but this brings up a few questions:
Would it be a valid approach to train an ensemble of models (whose only difference is the seed for random initialization) & report the mean of the models' predictions as the predicted value (my thought is that this is effectively similar to a random forest, but with a forest of regularized linear models rather than decision trees).
If so, is it valid to use the predictions as a distribution that I can calculate statistical properties (e.g., sd, p < 0.025, p > 0.975)?
Finally, if all the above is true, is this method extensible to other model types that similarly don't offer intervals around the predictions?
*I should also note that (if implemented), the ensemble would be trained on bootstrapped datasets.
Here's an example on a toy dataset:
# libraries ----
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(pins)
#> Warning: package 'pins' was built under R version 4.1.1
# register board ----
board <- board_temp()
# model setup ----
cars_model <- mtcars %>% as_tibble()
cars_model <-
cars_model %>%
mutate(across(.cols = c(cyl, vs, am, gear, carb), as.character))
# splits
cars_split <- initial_split(cars_model)
cars_train <- training(cars_split)
cars_test <- testing(cars_split)
# basic preprocessing
cars_rec <-
recipe(qsec ~ ., data = cars_train) %>%
step_other(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors())
# utilize a mixed ridge/lasso model
cars_spec <-
linear_reg(penalty = 0.01,
mixture = 0.5) %>%
set_engine("glmnet")
# setup workflow for fitting
cars_wf <-
workflow() %>%
add_recipe(cars_rec) %>%
add_model(cars_spec)
# setup bootstraps for fitting
set.seed(123)
cars_boot <- bootstraps(cars_train)
# save models to board ----
# pin 25 models to the board
for (i in 1:25) {
# get training data from the bootstrap
temp_train <- analysis(cars_boot$splits[[i]])
# add new model to board
board %>%
pin_write(cars_wf %>% fit(temp_train),
paste0("model", i))
}
#> Guessing `type = 'rds'`
#> Creating new version '20211209T234500Z-a57ba'
#> Writing to pin 'model1'
#> Guessing `type = 'rds'`
#> Creating new version '20211209T234500Z-3356f'
#> Writing to pin 'model2'
#> Guessing `type = 'rds'`
#> Creating new version '20211209T234500Z-28ce0'
#> Writing to pin 'model3'
#> Guessing `type = 'rds'`
#> Creating new version '20211209T234501Z-e8ece'
#> Writing to pin 'model4'
#> Guessing `type = 'rds'`
#> Creating new version '20211209T234501Z-36134'
#> Writing to pin 'model5'
#> Guessing `type = 'rds'`
#> Creating new version '20211209T234501Z-d0265'
#> Writing to pin 'model6'
#> Guessing `type = 'rds'`
#> Creating new version '20211209T234501Z-9f651'
#> Writing to pin 'model7'
#> Guessing `type = 'rds'`
#> Creating new version '20211209T234502Z-cdc60'
#> Writing to pin 'model8'
#> Guessing `type = 'rds'`
#> Creating new version '20211209T234502Z-f8234'
#> Writing to pin 'model9'
#> Guessing `type = 'rds'`
#> Creating new version '20211209T234502Z-92cd9'
#> Writing to pin 'model10'
#> Guessing `type = 'rds'`
#> Creating new version '20211209T234502Z-9f35d'
#> Writing to pin 'model11'
#> Guessing `type = 'rds'`
#> Creating new version '20211209T234503Z-1119f'
#> Writing to pin 'model12'
#> Guessing `type = 'rds'`
#> Creating new version '20211209T234503Z-b6fda'
#> Writing to pin 'model13'
#> Guessing `type = 'rds'`
#> Creating new version '20211209T234503Z-1b5e1'
#> Writing to pin 'model14'
#> Guessing `type = 'rds'`
#> Creating new version '20211209T234503Z-25349'
#> Writing to pin 'model15'
#> Guessing `type = 'rds'`
#> Creating new version '20211209T234504Z-f0845'
#> Writing to pin 'model16'
#> Guessing `type = 'rds'`
#> Creating new version '20211209T234504Z-97654'
#> Writing to pin 'model17'
#> Guessing `type = 'rds'`
#> Creating new version '20211209T234504Z-c8440'
#> Writing to pin 'model18'
#> Guessing `type = 'rds'`
#> Creating new version '20211209T234504Z-7a9b6'
#> Writing to pin 'model19'
#> Guessing `type = 'rds'`
#> Creating new version '20211209T234505Z-ef3fe'
#> Writing to pin 'model20'
#> Guessing `type = 'rds'`
#> Creating new version '20211209T234505Z-1e960'
#> Writing to pin 'model21'
#> Guessing `type = 'rds'`
#> Creating new version '20211209T234505Z-fe942'
#> Writing to pin 'model22'
#> Guessing `type = 'rds'`
#> Creating new version '20211209T234505Z-c49d4'
#> Writing to pin 'model23'
#> Guessing `type = 'rds'`
#> Creating new version '20211209T234506Z-96c10'
#> Writing to pin 'model24'
#> Guessing `type = 'rds'`
#> Creating new version '20211209T234506Z-07f96'
#> Writing to pin 'model25'
# read & bind predictions ----
# predict on test dataset from 25 pinned models
for (i in 1:25) {
# get results from i-th model
temp_preds <-
board %>%
pin_read(paste0("model", i)) %>%
predict(new_data = cars_test)
# add predictions to wide-format col
if (i == 1) {
model_preds <-
temp_preds
} else {
model_preds <-
model_preds %>%
bind_cols(temp_preds)
}
}
#> New names:
#> * .pred -> .pred...1
#> * .pred -> .pred...2
#> New names:
#> * .pred -> .pred...3
#> New names:
#> * .pred -> .pred...4
#> New names:
#> * .pred -> .pred...5
#> New names:
#> * .pred -> .pred...6
#> New names:
#> * .pred -> .pred...7
#> New names:
#> * .pred -> .pred...8
#> New names:
#> * .pred -> .pred...9
#> New names:
#> * .pred -> .pred...10
#> New names:
#> * .pred -> .pred...11
#> New names:
#> * .pred -> .pred...12
#> New names:
#> * .pred -> .pred...13
#> Warning: There are new levels in a factor: 6
#> New names:
#> * .pred -> .pred...14
#> New names:
#> * .pred -> .pred...15
#> New names:
#> * .pred -> .pred...16
#> New names:
#> * .pred -> .pred...17
#> New names:
#> * .pred -> .pred...18
#> New names:
#> * .pred -> .pred...19
#> New names:
#> * .pred -> .pred...20
#> New names:
#> * .pred -> .pred...21
#> New names:
#> * .pred -> .pred...22
#> New names:
#> * .pred -> .pred...23
#> New names:
#> * .pred -> .pred...24
#> New names:
#> * .pred -> .pred...25
# plot range against truth
cars_test %>%
rowid_to_column() %>%
bind_cols(model_preds) %>%
drop_na() %>%
select(rowid, qsec, starts_with(".pred")) %>%
pivot_longer(starts_with(".pred"),
names_to = "model",
values_to = "prediction") %>%
group_by(rowid) %>%
summarise(.estimate = mean(prediction),
std_dev = sd(prediction)) %>%
riekelib::normal_interval(.estimate, std_dev) %>%
left_join(cars_test %>% rowid_to_column(), by = "rowid") %>%
select(rowid, .estimate, ci_lower, ci_upper, qsec) %>%
rename(.truth = qsec) %>%
pivot_longer(c(.estimate, .truth),
names_to = "metric",
values_to = "value") %>%
ggplot(aes(x = rowid,
y = value)) +
geom_errorbar(aes(ymin = ci_lower,
ymax = ci_upper),
color = "red",
alpha = 0.1,
size = 1) +
geom_point(aes(color = metric),
size = 3,
alpha = 0.5)
#> New names:
#> * .pred...1 -> .pred...13
#> * .pred...2 -> .pred...14
#> * .pred...3 -> .pred...15
#> * .pred...4 -> .pred...16
#> * .pred...5 -> .pred...17
#> * ...
Created on 2021-12-09 by the reprex package (v2.0.1)
system
Closed
December 30, 2021, 10:33pm
3
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed. If you have a query related to it or one of the replies, start a new topic and refer back with a link.