First, if I pass a workflow with a recipe to fit_resamples, will it prep the recipe for each fold or will it prep using the whole training dataset used to define the recipe? I am trying to avoid data leakage like described here Recipes with rsample • rsample
The way fit_resamples()
works is that for each resample it will prep()
the recipe using the training
/analysis
portion of the resample, and will use this prep'd recipe to bake()
the testing
/assessment
portion of the resample. There won't be any data leakage (i.e. no test data will be used during feature engineering).
Second, if I have an outcome transform, how can I get fit_resamples() to calculate metrics against this transformation? For example, I have a recipe that takes the log of the outcome variable for training but then when fit_resamples runs predictions, it predicts the log outcome and compares it to the untransformed outcome so the metrics are not correct.
fit_resamples()
will bake()
the test data before making predictions and computing metrics. So the metrics are based on a comparison between the predictions and log-transformed outcome, in your example. See this example for verification (note that mpg
is log-transformed when we glimpse()
the data in the final code chunk):
library(tidyverse)
library(tidymodels)
rs <- vfold_cv(mtcars, v = 5)
rs
#> # 5-fold cross-validation
#> # A tibble: 5 x 2
#> splits id
#> <list> <chr>
#> 1 <split [25/7]> Fold1
#> 2 <split [25/7]> Fold2
#> 3 <split [26/6]> Fold3
#> 4 <split [26/6]> Fold4
#> 5 <split [26/6]> Fold5
rec <-
recipe(mpg ~ hp, data = mtcars) %>%
step_log(mpg)
mm <-
linear_reg() %>%
set_engine("lm")
fit <-
workflow() %>%
add_model(mm) %>%
add_recipe(rec) %>%
fit_resamples(
resamples = rs,
control = control_resamples(save_pred = TRUE)
)
collect_metrics(fit)
#> # A tibble: 2 x 5
#> .metric .estimator mean n std_err
#> <chr> <chr> <dbl> <int> <dbl>
#> 1 rmse standard 0.192 5 0.0351
#> 2 rsq standard 0.671 5 0.106
collect_predictions(fit) %>%
glimpse() %>%
group_by(id) %>%
rmse(mpg, .pred) %>%
summarise(rmse = mean(.estimate))
#> Rows: 32
#> Columns: 4
#> $ id <chr> "Fold1", "Fold1", "Fold1", "Fold1", "Fold1", "Fold1", "Fold1", …
#> $ .pred <dbl> 2.888880, 2.818527, 2.790386, 3.209690, 3.249087, 2.973304, 2.6…
#> $ .row <int> 14, 15, 16, 18, 19, 23, 29, 1, 2, 4, 9, 11, 26, 28, 3, 8, 10, 1…
#> $ mpg <dbl> 2.721295, 2.341806, 2.341806, 3.478158, 3.414443, 2.721295, 2.7…
#> # A tibble: 1 x 1
#> rmse
#> <dbl>
#> 1 0.192