Thanks Max. Another follow-up if you don't mind:
So I'm trying to use workflow_set() to create a bunch of workflows and workflow_map() to conduct resampling on them. I have the following code:
library(tidymodels)
library(ranger)
# Split into training and test sets
set.seed(123)
data.split <- initial_split(mtcars, prop=.75)
train_data <- training(data.split)
test_data <- testing(data.split)
# Create a linear model
lm_model <- linear_reg() %>% set_engine("lm")
# Create a random forest model
rf_model <- rand_forest(trees = 1000) %>% set_engine("ranger") %>% set_mode("regression")
# Create recipes
recipe1 <- recipe(mpg ~ ., data = train_data) %>% step_dummy(all_nominal_predictors())
recipe2 <- recipe(mpg ~ wt, data = train_data) %>% step_dummy(all_nominal_predictors())
recipe3 <- recipe(mpg ~ wt + hp, data = train_data) %>% step_dummy(all_nominal_predictors())
# Create list of models
model_list <- list(rand_forest = rf_model, lm = lm_model)
recipe_list <- list(all = recipe1, wt_only = recipe2, wt_hp = recipe3)
# Create workflows
workflows_combo <- workflow_set(preproc = recipe_list, models = model_list, cross = TRUE)
# Creating folds using original, unprocessed training data
cv_data <- vfold_cv(train_data, v = 10)
# Resample on each workflow object
resampling_result <- workflows_combo %>%
workflow_map("fit_resamples", seed = 1101, verbose = TRUE, resamples = cv_data)
collect_metrics(resampling_result) %>% filter(.metric == "rsq")
collect_metrics(resampling_result) %>% filter(.metric == "rmse")
As you can see in this code, I created the folds almost at the end, after I had created the workflows with workflow_set() and before running workflow_map(). As I ran workflow_map there was a message "Fold05: internal: A correlation computation is required, but estimate is constant and has 0 standard deviation, resulting in a divide by 0 error. NA will be returned." I did google the warning message and it took me to a discussion on Github by you and others that I wasn't sure I understood.
That said, when I ran the code above with a slight modification where I created the fold right after I split the data into training and test set and before doing the rest of the stuff, I no longer got the warning message.
Would you mind explaining why it matters when the folds are created? And also what is the "correct" procedure that one should follow to get accurate results? Thanks so much!