fit_resamples() prep and outcome transform behavior

mattwarkentin · October 8, 2020, 5:58pm

First, if I pass a workflow with a recipe to fit_resamples, will it prep the recipe for each fold or will it prep using the whole training dataset used to define the recipe? I am trying to avoid data leakage like described here Recipes with rsample • rsample

The way fit_resamples() works is that for each resample it will prep() the recipe using the training/analysis portion of the resample, and will use this prep'd recipe to bake() the testing/assessment portion of the resample. There won't be any data leakage (i.e. no test data will be used during feature engineering).

Second, if I have an outcome transform, how can I get fit_resamples() to calculate metrics against this transformation? For example, I have a recipe that takes the log of the outcome variable for training but then when fit_resamples runs predictions, it predicts the log outcome and compares it to the untransformed outcome so the metrics are not correct.

fit_resamples() will bake() the test data before making predictions and computing metrics. So the metrics are based on a comparison between the predictions and log-transformed outcome, in your example. See this example for verification (note that mpg is log-transformed when we glimpse() the data in the final code chunk):

library(tidyverse)
library(tidymodels)

rs <- vfold_cv(mtcars, v = 5)
rs
#> #  5-fold cross-validation 
#> # A tibble: 5 x 2
#>   splits         id   
#>   <list>         <chr>
#> 1 <split [25/7]> Fold1
#> 2 <split [25/7]> Fold2
#> 3 <split [26/6]> Fold3
#> 4 <split [26/6]> Fold4
#> 5 <split [26/6]> Fold5

rec <-
  recipe(mpg ~ hp, data = mtcars) %>% 
  step_log(mpg)

mm <- 
  linear_reg() %>% 
  set_engine("lm")

fit <- 
  workflow() %>% 
  add_model(mm) %>% 
  add_recipe(rec) %>% 
  fit_resamples(
    resamples = rs,
    control = control_resamples(save_pred = TRUE)
  ) 

collect_metrics(fit)
#> # A tibble: 2 x 5
#>   .metric .estimator  mean     n std_err
#>   <chr>   <chr>      <dbl> <int>   <dbl>
#> 1 rmse    standard   0.192     5  0.0351
#> 2 rsq     standard   0.671     5  0.106

collect_predictions(fit) %>% 
  glimpse() %>% 
  group_by(id) %>% 
  rmse(mpg, .pred) %>% 
  summarise(rmse = mean(.estimate))
#> Rows: 32
#> Columns: 4
#> $ id    <chr> "Fold1", "Fold1", "Fold1", "Fold1", "Fold1", "Fold1", "Fold1", …
#> $ .pred <dbl> 2.888880, 2.818527, 2.790386, 3.209690, 3.249087, 2.973304, 2.6…
#> $ .row  <int> 14, 15, 16, 18, 19, 23, 29, 1, 2, 4, 9, 11, 26, 28, 3, 8, 10, 1…
#> $ mpg   <dbl> 2.721295, 2.341806, 2.341806, 3.478158, 3.414443, 2.721295, 2.7…
#> # A tibble: 1 x 1
#>    rmse
#>   <dbl>
#> 1 0.192