How do I go about re-training my model to the entire dataset?

I'm a little confused on where exactly re-training my model comes in (a bit new a this).

I have a tuning, workflow, and cross-validation steps (reprex below, adapted from a tutorial by Julia Silge).

After the comparison of observed values to predicted values is done, which is based on my tuned model, suppose I'm happy with the results. Then, how can I go back and re-train the model on all the data so it's ready to be saved/deployed?

library(tidymodels)
library(tidypredict)

tidymodels_prefer()

#Initial split, generate training and testing
mysplit <- initial_split(iris %>% select(-Species), strata=Petal.Width)
training_set <- training(mysplit)
test_set <- testing(mysplit)


#Set up the model specification
#The hyperparameters will be tuned
xgb_spec <- boost_tree(
  trees = 1000,
  tree_depth = tune(),
  min_n = tune(),
  loss_reduction = tune(),                    
  sample_size = tune(),
  mtry = tune(),   
  learn_rate = tune()                          
) %>%
  set_engine("xgboost") %>%
  set_mode("regression")


#Set up a space-filling grid design to cover the hyperparameter space as well as possible
xgb_grid <- grid_latin_hypercube(
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), training_set), #gets treated differently b/c it depends on actual # of predictors in data
  learn_rate(),
  size = 30
)

#Put the model specification into a workflow
xgb_wf <- workflow() %>%
  add_formula(Petal.Width ~.) %>% 
  add_model(xgb_spec)


#Create cross-validation resamples for tuning the model
input_folds <- vfold_cv(training_set, strata=Petal.Width)


#Use tunable workflow to tune
doParallel::registerDoParallel()
xgb_res <- tune_grid(
  xgb_wf,
  resamples = input_folds,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE)
)


#Select the best parameters based on RMSE
best_rmse <- select_best(xgb_res, "rmse")


#Finalize the tuneable workflow using the best parameters
final_xgb <- finalize_workflow(
  xgb_wf,
  best_rmse
)

#############
#Fit the final best model to training set and evaluate the test set
final_res <- last_fit(final_xgb, mysplit)
#############


#Get the model-predicted values of the test set
pred_df <- 
  final_res %>%
  collect_predictions() %>%
  as.data.frame()

My understanding is that a whole data model risks overfitting at any stage.

Do you mean on the entire training set (as opposed to the resampled models) or on the combination of the training and testing set combined?

For the former, it's in final_res and you can get it with extract_workflow(final_res).

For the latter, you advise not doing that since all of your performance metrics are on the models from the training set. You could, but there is a risk of introducing bias in your model. If you do, then finalize the model (or workflow) with the tuning parameter values (see the finalize_*() functions) and then run a last fit() on a data frame with all of the data.

Hi Max,

Thanks for responding here. My understanding is that in the field, a (common? standard?) practice is to train your final model on all your data (like point #5 in the top comment of this post).

We'd be estimating the performance of the model based on the building stage that used the training and test sets, and we could not know the actual performance of the final model because there is no more ground truth data once the model has been trained on everything.

It seems as though the potential bias is a commonly accepted sacrifice you make for the benefit of having the model trained on a typically significant amount of more data (the tidymodels default split would mean that you have 25% more data to train on in the end).

You're the expert here so I may be totally wrong about all this! Just trying to convey my interpretation from reading various sources, which certainly could be wrong.

In the case of this reprex, the initial split was done like this:
mysplit <- initial_split(iris %>% select(-Species), strata=Petal.Width)

And fitting the final best model was obtained by :
final_res <- last_fit(final_xgb, mysplit)

....So if it actually is a good idea to re-train on everything, I'm not sure if I would just need to do this:
final_res2 <- last_fit(final_xgb, iris %>% select(-Species)) or instead do something using finalize_

edit : if the best practice actually is to just use the final model here, it would certainly be easiest to just leave things as-is and I'm happy to do that! I'm learning a lot using tidymodels. Tagging @julia in case she wants to provide insight.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.