recipe step vs modifying data

David911 · November 9, 2022, 6:43pm

I am confused about the advantages of using recipe steps for data transform as opposed to modifying the data itself.

for example, if I have a process like:

Get data
Simple cleaning
Split
Explore training data

And this process leads me to believe that I want to log transform my dependent variable, what is the advantage of adding to a recipe step_log(y), as compared to adding a mutate(y=log(y)) to my Simple cleaning process above and then rerunning Split.

It is easier to make sure things are going as intended if you modify the actual data, I think. I do see that there are some very handy recipe steps, so that is an advantage, are there others? A disadvantage is that it is harder to evaluate choices (e.g. picking parameters for step_other).

Thanks for your help,

David

Max · November 9, 2022, 7:27pm

If you are doing any deterministic, non-estimation type stuff then it make sense to do those parts up-front as you suggest.

Otherwise, it's best to put it into a recipe so that your performance statistics are appropriate. Using a recipe also has the side benefit of you not having to code anything special when new data arrives.

Some examples that are good for up-front work:

computing features from dates (e.g. month, day of the week, etc).
log transformations

Things that should really go into a recipe:

PCA
centering, scaling, Box-Cox transformations (all use statistical estimates)
feature selection
imputation

Some of this is a bit philosophical. Take a look at this video where we discuss this and the ideas/why behind our recommendations.

system · November 30, 2022, 7:27pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.