Are `prep()`, `bake()`, and `juice()` needed to train/evaluate a model?

I'm confused about when (if at all) I need to use prep(), bake(), and juice(). The case study currently on the main tidymodels page does not use them, but these functions appear in older tidymodels tutorials you can find online.

I'm following the tidymodels case study (condensed here), and it does not use these functions.

splits      <- initial_split(hotels, strata = children)
hotel_other <- training(splits)
hotel_test  <- testing(splits)

val_set <- validation_split(hotel_other, 
                            strata = children, 
                            prop = 0.80)

cores <- parallel::detectCores()

rf_mod <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% 
  set_engine("ranger", num.threads = cores) %>% 
  set_mode("classification")

rf_recipe <- 
  recipe(children ~ ., data = hotel_other) %>% 
  step_date(arrival_date) %>% 
  step_holiday(arrival_date) %>% 
  step_rm(arrival_date) 

rf_workflow <- 
  workflow() %>% 
  add_model(rf_mod) %>% 
  add_recipe(rf_recipe)

set.seed(345)
rf_res <- 
  rf_workflow %>% 
  tune_grid(val_set,
            grid = 25,
            control = control_grid(save_pred = TRUE),
            metrics = metric_set(roc_auc))

rf_best <- 
  rf_res %>% 
  select_best(metric = "roc_auc")

last_rf_mod <- 
  rand_forest(mtry = 8, min_n = 7, trees = 1000) %>% 
  set_engine("ranger", num.threads = cores, importance = "impurity") %>% 
  set_mode("classification")

last_rf_workflow <- 
  rf_workflow %>% 
  update_model(last_rf_mod)

last_rf_fit <- 
  last_rf_workflow %>% 
  last_fit(splits)

Hi @ericpgreen,

prep(), bake(), and juice() are only necessary when you are using recipes to pre-process your data. Even then, the tidymodels/workflows framework calls these functions internally when needed, so you don't really need to call these functions manually.

In the example you shared, when rf_workflow is fit using tune_grid(...), internally, the rf_recipe will be prep()'d and juice()'d on the analysis portion of val_set (80% of the 80% training set), the model is then fit, and then the prepared recipe is bake()'d on the assessment set before computing performance metrics. This is all done for you when using workflows.

If you aren't doing recipes preprocessing or resampling/hyperparameter tuning, for example, or if you've done your preprocessing manually you could just use the fit function on a parsnip or workflow object to bypass pre-processing altogether.

Optionally, outside of the parsnip/workflows framework, you can still take advantage of recipes and use prep(), juice(), and bake() manually to pre-process your data and fit models using any package/approach.

I hope this was helpful.

3 Likes

Very helpful, @mattwarkentin. Thanks for taking the time to explain it.

1 Like

If you would like to read a bit more detail on what these functions do, you can check out this chapter of the Tidy Modeling with R book. I think this schematic is helpful:

@mattwarkentin described what happens when you fit or tune a workflow really nicely; like he said, it is handling that preprocessing under the hood for you. If you want to read more about that, it's described here.

It probably is helpful to know how to prep() and bake() if you are going to be a tidymodels user, because often you want to be able to dig into the internals or troubleshoot problems with recipe preprocessing.

3 Likes

This is great @julia. Thank you. I was not making the connection to what workflow was doing behind the scenes. This clears it up.

1 Like