How to create persistent createMultiFolds (for repeated resampling) with case weights in caret/ recipes

Hello!

I've been trying to create and save a persistent multi fold object for repeated CV (created with createMultiFolds) so that I could save it and come back to it in some time in case I would like to repeat the entire analysis. I basically wanna avoid playing with random seeds but just have that set in stone. It's not hard in itself but the problem starts when I'm trying to use it with case weights.

I've experimented with all kinds of possibilities including: creating a 'case weight' variable role in recipes, but none of them work really like I want. I do not want to use sampling instead so it's not really an option here.

In the end I though that perhaps when using multi folds I also need to have a similar list structure with weights in a list (like multi folds outcome), but apparently it's not the case (you can see it in my example).

Please find a reproducible example below, but it doesn't work as of now. The error thrown is:

Warning messages:
1: model fit failed for Fold1.Rep1: alpha=0.00, lambda=1 Error in (function (x, y, family = c("gaussian", "binomial", "poisson",  : 
  number of elements in weights (25) not equal to the number of rows of x (80)

Full code below:

set.seed(42)

# Loading libraries -------------------------------------------------------

library(magrittr)
library(tidyverse)
library(tidymodels)
library(dials)
library(furrr)

# Loading input dataset ---------------------------------------------------

df_all <- iris %>% 
  filter(Species != "setosa") %>% 
  mutate(Species = factor(Species, levels = c("versicolor", "virginica")))

# Preparing the recipes ----------------------------------------------------

# I need to add a custom step over here on the missing patterns

en_rec <- df_all %>% 
  recipe(Species ~ .) %>% 
  step_pca(all_predictors(), num_comp = 2)

# Training models ---------------------------------------------------------

folds <- createMultiFolds(df_all$Species, k = 5, times = 5)

ctrl <- trainControl(
  method = "repeatedcv",
  number = 5,
  repeats = 5,
  index = folds,
  verboseIter = TRUE,
  summaryFunction = defaultSummary,
  returnResamp = "final",
  savePredictions = "final"
)

en_grid <- expand.grid(
  alpha = c(0, .25, .50, .75, 1),
  lambda = 10 ^ seq(-4, 0, length = 30)
)

en_model <- train(
  en_rec,
  data = df_all,
  method = "glmnet",
  trControl = ctrl,
  tuneGrid = en_grid,
  weights = map(folds, ~if_else(df_all$Species[.x] == "versicolor", 10, 1))
)

I was able to get it to work by applying the following modification to the train function:

en_model <- train(
  Species ~ .,
  data = juice(prep(en_rec, retain = TRUE)),
  method = "glmnet",
  trControl = ctrl,
  tuneGrid = en_grid,
  weights = if_else(df_all$Species == "versicolor", 10, 1)
)

So I need to pass the formula, apply baking of the training set and then the weights vector works as expected. Isn't that a bit of a 'hacky' way of doing that?

I'm then unable to leverage in full the functionality of recipes, for example: having columns of different roles not being used in training. I need to exclude them specifically before that.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.