Hi all,
I’m confused about what prep() and bake() do. I read the response to this question here (tidyverse - What is the difference among prep/bake/juice in the R package "recipes"? - Stack Overflow) and my understanding is as follows:
- when prep() is run, it basically takes the data provided to it (the training data) and computes all the necessary quantities using the training data to process the recipe.
- then when bake() is run on test data, it takes all the quantities from the previous step and apply them on test data. The relevant text is, “ Or it could be the testing data. In this case, the column means from the training data are applied to the testing data, because that is what happens IRL in a modeling workflow. To do otherwise is data leakage.”
I have two questions:
- First, when we prep() the data, is it ALWAYS the training data that we have to provide to the function?
- Second, when we bake() the test data, why would we want to use quantities computed from training data? (This is assuming my understanding of what bake() does above is correct. If it’s not, please help me understand the functions better). Why not use the quantities from the test data itself?
Thanks so much!