fit_resamples() prep and outcome transform behavior

I have a couple of clarification questions on how fit_resamples() works.

First, if I pass a workflow with a recipe to fit_resamples, will it prep the recipe for each fold or will it prep using the whole training dataset used to define the recipe? I am trying to avoid data leakage like described here https://rsample.tidymodels.org/articles/Applications/Recipes_and_rsample.html

Second, if I have an outcome transform, how can I get fit_resamples() to calculate metrics against this transformation? For example, I have a recipe that takes the log of the outcome variable for training but then when fit_resamples runs predictions, it predicts the log outcome and compares it to the untransformed outcome so the metrics are not correct.

First, if I pass a workflow with a recipe to fit_resamples, will it prep the recipe for each fold or will it prep using the whole training dataset used to define the recipe? I am trying to avoid data leakage like described here Recipes with rsample • rsample

The way fit_resamples() works is that for each resample it will prep() the recipe using the training/analysis portion of the resample, and will use this prep'd recipe to bake() the testing/assessment portion of the resample. There won't be any data leakage (i.e. no test data will be used during feature engineering).


Second, if I have an outcome transform, how can I get fit_resamples() to calculate metrics against this transformation? For example, I have a recipe that takes the log of the outcome variable for training but then when fit_resamples runs predictions, it predicts the log outcome and compares it to the untransformed outcome so the metrics are not correct.

fit_resamples() will bake() the test data before making predictions and computing metrics. So the metrics are based on a comparison between the predictions and log-transformed outcome, in your example. See this example for verification (note that mpg is log-transformed when we glimpse() the data in the final code chunk):

library(tidyverse)
library(tidymodels)

rs <- vfold_cv(mtcars, v = 5)
rs
#> #  5-fold cross-validation 
#> # A tibble: 5 x 2
#>   splits         id   
#>   <list>         <chr>
#> 1 <split [25/7]> Fold1
#> 2 <split [25/7]> Fold2
#> 3 <split [26/6]> Fold3
#> 4 <split [26/6]> Fold4
#> 5 <split [26/6]> Fold5

rec <-
  recipe(mpg ~ hp, data = mtcars) %>% 
  step_log(mpg)

mm <- 
  linear_reg() %>% 
  set_engine("lm")

fit <- 
  workflow() %>% 
  add_model(mm) %>% 
  add_recipe(rec) %>% 
  fit_resamples(
    resamples = rs,
    control = control_resamples(save_pred = TRUE)
  ) 

collect_metrics(fit)
#> # A tibble: 2 x 5
#>   .metric .estimator  mean     n std_err
#>   <chr>   <chr>      <dbl> <int>   <dbl>
#> 1 rmse    standard   0.192     5  0.0351
#> 2 rsq     standard   0.671     5  0.106

collect_predictions(fit) %>% 
  glimpse() %>% 
  group_by(id) %>% 
  rmse(mpg, .pred) %>% 
  summarise(rmse = mean(.estimate))
#> Rows: 32
#> Columns: 4
#> $ id    <chr> "Fold1", "Fold1", "Fold1", "Fold1", "Fold1", "Fold1", "Fold1", …
#> $ .pred <dbl> 2.888880, 2.818527, 2.790386, 3.209690, 3.249087, 2.973304, 2.6…
#> $ .row  <int> 14, 15, 16, 18, 19, 23, 29, 1, 2, 4, 9, 11, 26, 28, 3, 8, 10, 1…
#> $ mpg   <dbl> 2.721295, 2.341806, 2.341806, 3.478158, 3.414443, 2.721295, 2.7…
#> # A tibble: 1 x 1
#>    rmse
#>   <dbl>
#> 1 0.192

Thanks! I encountered a problem when using skip = T e.g. step_log(all_outcomes(), skip = T) because, as you said, fit_resamples() will bake() the recipe for the assessment set. I could just set skip=F but I am hoping to keep this so I can apply the same recipe to a test dataset without labels. Is there an elegant way to do this?

You'll want skip = FALSE, the default for step_log(), when doing resampling analysis otherwise the metrics will be wrong due to comparing predictions on the log-scale to natural scale truths. I almost think that bake() should have an argument to be able to work while ignoring missing outcomes. This would be relevant when doing a classic train-test-validation setup.

Perhaps @Max can weigh in on why bake() has to fail when the outcome isn't present in new_data. There may be something I'm overlooking.

Actually, we strongly suggest doing any calculations on the outcome prior to the recipe. There is a good explanation of this in the book.

We follow good practice and isolate the data being predicted form the outcome (if it exists) so bake() is never given the outcome data.

With tune, you can skip the step without error but then the raw outcome data are not on the same scale as the predictions when performance is evaluated.

It's better just to do a mutate() up-front on the outcome data.

2 Likes

Thanks for the clarification, Max. Out of curiosity, are you considering eventually deprecating all_outcomes() and/or throwing an error when LHS variables are included in steps?

No. I don't think that we'll do that.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.