I'm trying to make
recipes a part of my work flow and have two questions i was hoping to get help with. One is more practical and the other more theoretical.
Taking the example from the main website, How do i actually see the new dataset after the transformation based on my recipes? This is more for my piece of mind to explore afterwards
library(recipes) library(mlbench) data(Sonar) sonar_rec <- recipe(Class ~ ., data = Sonar) %>% step_center(all_predictors()) %>% step_scale(all_predictors())
The second question is a bit more theory related and might be from my lack of understanding of
building model predictors
Lets imagine i have a dataset which i split into a training and a test set. My understanding of centering on a very high level is we subtract the variable mean from each of the scores to produce the new variable score (I know there is a bit more to it ) .
If we use the entire dataset to do this it will produce one set of scaled variables. We then split the data into training and test set and train the model on the train data and test the validity on the test set, Does this inadvertently give information to my model because we use the overall mean in the scaling of the numeric features?
If we go the other way and apply the scaling and centering on the group level (in this case training group mean and test group mean), the group means have the potential to be different. If I take this one step further and use cross validation there is a potential to be k*2 different means depending on the number slices (10k validation would have ten training sets and ten test sets each with their own unique means)
I guess my question after all that is, what are the steps to be used and if the above is anything to worry about at all or if I'm just over-caffeinated
Thank you all for your help