Yet another question about prep() and bake()

Hi all,

I’m confused about what prep() and bake() do. I read the response to this question here (tidyverse - What is the difference among prep/bake/juice in the R package "recipes"? - Stack Overflow) and my understanding is as follows:

  • when prep() is run, it basically takes the data provided to it (the training data) and computes all the necessary quantities using the training data to process the recipe.
  • then when bake() is run on test data, it takes all the quantities from the previous step and apply them on test data. The relevant text is, “ Or it could be the testing data. In this case, the column means from the training data are applied to the testing data, because that is what happens IRL in a modeling workflow. To do otherwise is data leakage.”

I have two questions:

  • First, when we prep() the data, is it ALWAYS the training data that we have to provide to the function?
  • Second, when we bake() the test data, why would we want to use quantities computed from training data? (This is assuming my understanding of what bake() does above is correct. If it’s not, please help me understand the functions better). Why not use the quantities from the test data itself?

Thanks so much!

This is a pretty common question.

First, when we prep() the data, is it ALWAYS the training data that we have to provide to the function?

Yes, it should always use the training set (just like the model does).

Second, when we bake() the test data, why would we want to use quantities computed from training data? (This is assuming my understanding of what bake() does above is correct. If it’s not, please help me understand the functions better). Why not use the quantities from the test data itself?

You want the data being predicted to be normalized/on the same scale as the data that were used to build the model. Otherwise, the model is getting data that isn't quite right.

Here's a few good examples (I think) related to centering/scaling...

First, imagine that you are predicting a single sample. You can't compute the standard deviation from that and would be unable to make predictions.

Second, consider the case where the new samples being predicted come from the edges of the population. For example, the Chicago train data is used to predict how many people take the train each day. With Covid, the ridership is about 1/20th of what is used to be.

If we were to have a model built with all data and predict on any pandemic data, the means and standard deviations would be really different from those used to build the model. The predictor values given to the model would not be from the same distribution as the training set was.

Finally, think of the recipe as being very similar to the model: we are estimating stuff. We wouldn't think that it is a good idea to re-calculate the model parameters for each time predict() is invoked.

Hi Max,

Thanks for the response. I'm still a bit confused, and I think this is related to the 2nd point you raised above. Let me try my best to describe it:

So the idea is I want to fit a model using training data and then use that same model on test data to see how well it does in terms of out of sample predictions. Let's say a variable in my model is the standardized value of, say, GPA. Let's assume further that in the training sample, its mean is 2.5, while in the test sample it's 3.2 (I'm totally making up numbers here) and that the stdevs are similar across the two samples. If we use the training mean on test data, aren't we artificially inflating the value of the standardized GPA when attempting to evaluate the model on test data? Don't we want the model to perform well using test data on its own scale?

Thanks for the time. Julia mentioned in her stackoverflow response that this is how modeling is done in real life, so I guess it's common knowledge. If you could suggest me something I can read up on on this topic I would really appreciate it.

No, we are making sure that the new value has a standardized value what would have been the same if it were in the training set.

It is a common question.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.