Hi,
I am currently reading the the draft version of the "Feature Engineering and Selection" book by Dr Max Kuhn and Dr Kjell Johnson. I am reading the chapter on re-sampling on what you should and should not calculate within the samples. I have two questions on it I was hoping people could help with
Assuming a cross validation of 5 folds where each fold comprises of a training set (referred to as an analysis set) and a test set (referred to as an assessment set)
-
Using class imbalance as an example, the book suggests that analysts when using the pre-processing step of down-sampling should apply the processioning step to every analysis section of the cross validated fold. When training the model is it poor practice to also apply this to the assessment section in each fold?
-
The library
rsample
allows you to create an X fold cross validation split of your dataset and then apply custom calculations within the fold. Previously I had a problem about creating means within groups in order to avoid data leakage from training fold to test fold. The end result produced different means per group within the analysis section and the assessment section of each fold. The libraryrecipes
allows you to create a recipe of transformations you wish to apply to your data-set before modelling. For example if we center the data using thestep_center
; Is this done using the whole training set or is it centered on just the analysis/assessment portion of the data.
Thank you for your time