recipe step vs modifying data

I am confused about the advantages of using recipe steps for data transform as opposed to modifying the data itself.

for example, if I have a process like:

  1. Get data
  2. Simple cleaning
  3. Split
  4. Explore training data

And this process leads me to believe that I want to log transform my dependent variable, what is the advantage of adding to a recipe step_log(y), as compared to adding a mutate(y=log(y)) to my Simple cleaning process above and then rerunning Split.

It is easier to make sure things are going as intended if you modify the actual data, I think. I do see that there are some very handy recipe steps, so that is an advantage, are there others? A disadvantage is that it is harder to evaluate choices (e.g. picking parameters for step_other).

Thanks for your help,

David

If you are doing any deterministic, non-estimation type stuff then it make sense to do those parts up-front as you suggest.

Otherwise, it's best to put it into a recipe so that your performance statistics are appropriate. Using a recipe also has the side benefit of you not having to code anything special when new data arrives.

Some examples that are good for up-front work:

  • computing features from dates (e.g. month, day of the week, etc).
  • log transformations

Things that should really go into a recipe:

  • PCA
  • centering, scaling, Box-Cox transformations (all use statistical estimates)
  • feature selection
  • imputation

Some of this is a bit philosophical. Take a look at this video where we discuss this and the ideas/why behind our recommendations.