I am trying to make sure I understand the process of doing cross validation correctly and know what goes through under the hood when running the fit_resamples function in R. Here are my questions:
Is it correct that once the data have been processed in some way (normalized, standardized, logged, etc.), then when we do cross validation, we want to create the folds based on the data that have been processed, not the original data?
I have the following code:
library(tidymodels) main_data <- mtcars # Split into training and test sets set.seed(123) data.split <- initial_split(main_data, prop=.75) train_data <- training(data.split) test_data <- testing(data.split) # Create a linear model lm_model <- linear_reg() %>% set_engine(“lm”) # Create a recipe from original data main_recipe <- recipe(mpg ~., data = train_data) %>% step_normalize(all_numeric()) %>% step_dummy(all_nominal_predictors()) # Putting everything into a workflow main_workflow <- workflow() %>% add_model(lm_model) %>% add_recipe(main_recipe) # Create processed training data processed_data <- main_recipe %>% prep() %>% juice() # Create the folds cv_data <- vfold_cv(processed_data, v = 10) # Operation 1 resample_1 <- main_workflow %>% fit_resamples(cv_data) # Operation 2 resample_2 <- lm_model %>% fit_resamples(main_recipe, cv_data) # Operation 3 resample_3 <- lm_model %>% fit_resamples(mpg~., cv_data) collect_metrics(resample_1) collect_metrics(resample_2) collect_metrics(resample_3) # Alternatively: creating folds using original, unprocessed training data cv_data2 <- vfold_cv(train_data, v = 10) # Operation 4 resample_4 <- lm_model %>% fit_resamples(mpg~., cv_data2) collect_metrics(resample_4)
So in this code I used the processed data to create the folds (creating the folds from process_data into an object named
cv_data). When I looked at the metrics the three operations gave me the exact same numbers. Is any of those three operations valid when evaluating the model's performance using resampling? Is there a way to do it without having to prep and bake (or juice) the original training data first?
In operation 4 I used the original, unprocessed data to create the folds and got different results from the first three just for illustration purposes.