I'm a little confused on where exactly re-training my model comes in (a bit new a this).
I have a tuning, workflow, and cross-validation steps (reprex below, adapted from a tutorial by Julia Silge).
After the comparison of observed values to predicted values is done, which is based on my tuned model, suppose I'm happy with the results. Then, how can I go back and re-train the model on all the data so it's ready to be saved/deployed?
library(tidymodels)
library(tidypredict)
tidymodels_prefer()
#Initial split, generate training and testing
mysplit <- initial_split(iris %>% select(-Species), strata=Petal.Width)
training_set <- training(mysplit)
test_set <- testing(mysplit)
#Set up the model specification
#The hyperparameters will be tuned
xgb_spec <- boost_tree(
trees = 1000,
tree_depth = tune(),
min_n = tune(),
loss_reduction = tune(),
sample_size = tune(),
mtry = tune(),
learn_rate = tune()
) %>%
set_engine("xgboost") %>%
set_mode("regression")
#Set up a space-filling grid design to cover the hyperparameter space as well as possible
xgb_grid <- grid_latin_hypercube(
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), training_set), #gets treated differently b/c it depends on actual # of predictors in data
learn_rate(),
size = 30
)
#Put the model specification into a workflow
xgb_wf <- workflow() %>%
add_formula(Petal.Width ~.) %>%
add_model(xgb_spec)
#Create cross-validation resamples for tuning the model
input_folds <- vfold_cv(training_set, strata=Petal.Width)
#Use tunable workflow to tune
doParallel::registerDoParallel()
xgb_res <- tune_grid(
xgb_wf,
resamples = input_folds,
grid = xgb_grid,
control = control_grid(save_pred = TRUE)
)
#Select the best parameters based on RMSE
best_rmse <- select_best(xgb_res, "rmse")
#Finalize the tuneable workflow using the best parameters
final_xgb <- finalize_workflow(
xgb_wf,
best_rmse
)
#############
#Fit the final best model to training set and evaluate the test set
final_res <- last_fit(final_xgb, mysplit)
#############
#Get the model-predicted values of the test set
pred_df <-
final_res %>%
collect_predictions() %>%
as.data.frame()