Resampling, Recipes, and Tidymodels

mharbur · March 3, 2021, 3:52pm

I have a general question about Tidymodels that I have not found an explicit answer to in the online texts.

Recipes can be used to engineer data.
Models can be used to fit the data.
Workflows can be used to link Recipes and Models and execute the fit.
Resampling can be used to rerun and cross-validate models.

My concern is that, when resampling is used to validate models, the data should be engineered specifically for the training data within each iteration. To engineer the data only once, prior to resampling, would provide the iteration with information about data structures it might not have in a truly predictive scenario, and the model may be overfit as a result.

My question is this: when you conduct a k-fold or other resampling exercise in Tidymodels, does it re-engineer the data within each iteration? Or is the training dataset first engineered in entirety, and then divided into training and test groups for each iteration?

Thank you!

mattwarkentin · March 3, 2021, 6:40pm

Hi @mharbur,

When you use parsnip or workflows and pass both a recipe and resampling object, the recipe is always prep'd on the "training" split, and bake'd onto both the "training" and "testing".

As an example, let's say you are doing 5-fold CV, and one of your recipe steps is centering your predictors (i.e. step_center()): for each of the 5 iterations, the mean value used for centering will be learned from the analysis split, and this mean value will be used when processing the assessment data for downstream predictions/metrics. Five different means will be learned for each of the five analysis splits, and these means will only be used when processing their assessment counterpart.

Hope thats helpful.

mharbur · March 3, 2021, 7:41pm

Hi @mattwarkentin,

That is what I had hoped!

Just to rephrase what you had said, with the 5-fold CV, within each interaction, the recipe will be prepped on the "training" split, and then baked into both the "training" and "testing". So each iteration is treated as a unique pair of "training" and "testing" datasets, and uniquely engineered with the steps of the given recipe. Is that description correct?

Thank you!

mattwarkentin · March 3, 2021, 7:53pm

Yes, that's exactly right.

mharbur · March 3, 2021, 10:15pm

Perfect! Thank you again.

system · March 24, 2021, 10:15pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.