Calculations to include in resampling

Hi,

I am currently reading the the draft version of the "Feature Engineering and Selection" book by Dr Max Kuhn and Dr Kjell Johnson. I am reading the chapter on re-sampling on what you should and should not calculate within the samples. I have two questions on it I was hoping people could help with

Assuming a cross validation of 5 folds where each fold comprises of a training set (referred to as an analysis set) and a test set (referred to as an assessment set)

  • Using class imbalance as an example, the book suggests that analysts when using the pre-processing step of down-sampling should apply the processioning step to every analysis section of the cross validated fold. When training the model is it poor practice to also apply this to the assessment section in each fold?

  • The library rsample allows you to create an X fold cross validation split of your dataset and then apply custom calculations within the fold. Previously I had a problem about creating means within groups in order to avoid data leakage from training fold to test fold. The end result produced different means per group within the analysis section and the assessment section of each fold. The library recipes allows you to create a recipe of transformations you wish to apply to your data-set before modelling. For example if we center the data using the step_center; Is this done using the whole training set or is it centered on just the analysis/assessment portion of the data.

Thank you for your time

About to get on a plane so have to keep it quick. The short answer is that you should include everything in the resampling step. I recommend following this rsample vignette.

You generally want your assessment set (as well as the test set) to have the same distributions as data that you would see "in the wild". I would not sample those data within resampling.

Yes to that. This is how resampling estimates the variation in your modeling process. There will most likely be different results within resamples.

During resample, the recipe should be executed (=trained) on every analysis set separately. Within each fold, it is applied to each analysis and assessment set.

Outside of resampling, once you've settled on a final series of steps, the recipe is applied to the entire training set and applied (=baked) to the training set (that will be used to build your final model) and for any other data set where predictions are needed.

2 Likes

Thank you very much @Max, Your explanation is very clear