What is the added value of using `step_*` verbs from `recipes` rather than `dplyr`/`tidyr`/`purrr` for feature engineering?

As a newcomer to tidymodels, I have a fundamental question regarding feature engineering: what benefit does recipes package have over dplyr tools? I mean, why would I "bother" working with the step_* verbs if I can just do all the feature engineering work with dplyr? It seems obvious to me that dplyr is much more powerful and versatile than step_* verbs. If I need to take a log() of a variable, or otherwise clean up, do math operations on any feature column, dplyr seems the most straightforward go-to. And for advanced feature engineering I would use dplyr in combination with tidyr and purrr.

But I'm afraid I could be missing something. Does recipes have any particular advantage over "regular" wrangling with dplyr/tidyr/purrr?

Thanks.

2 Likes

Hi @emman,

dplyr et al. have an immediate affect on your data. In other words, when you create a new variable or manipulate the data in some way, this change is applied right away.

The critical difference is that the step_* functions in the recipes package specify a definition of a data manipulation step without actually performing it. Feature engineering and data preprocessing are part of the model building and should be done ONLY in the training/analysis data, and then applied to the testing/assessment set data. For simple steps, like taking the log of a variable, it doesn't really matter because there aren't any learnable parameters. But even for a simple transformations like centering and scaling the data, the mean and the SD are learnable from the data. These learnable parameters should be learned from the training data only. If you center and scale the ENTIRE data set using dplyr before model fitting, there is technically some data leakage that may bias your model performance upward, even just slightly.

With recipes, you just define the steps that you plan to apply during the training process, but you don't immediately apply them. When you use the tidymodels infrastructure, you pass the recipe definition and the resamples and "behind the scenes" tidymodels trains the recipe on the training data ONLY and then applies the trained recipe to the training and testing data, as required.

Of course, you can always use recipes as a standalone package outside of the bigger tidymodels world, but the idea is still that you should not use the test/validation data to contribute to the data preprocessing.

2 Likes

Thanks @mattwarkentin. The topic of data leakage is new to me. In addition, learnable parameters is something I wasn't aware of when considering feature engineering. To make this discussion more concrete, I'd like to use an example. Below I provide some code for preprocessing with recipes and then the dplyr et al. equivalent. I would be grateful if you (or any other member here) could point to how data leakage would look like in the following code. Second, what learnable parameters we have here that recipes solves whereas dplyr does not.

The following code is adopted from Hansjörg Plieninger's blog post where he gives a tidymodels walkthrough.

We use the diamonds data from ggplot2.
First, let's show how we would build a recipes specification.

library(rsample)
library(recipes)
library(ggplot2) # for diamonds data

set.seed(123)
# step 1: split data to training and testing
my_diamonds <- diamonds[, c("carat", "cut", "price")]
init_split  <- initial_split(my_diamonds, prop = .1)
d_training  <- training(init_split)

# specify recipe
diamonds_recipe <-
  recipe(formula = price ~ ., data = d_training) %>%
  step_log(price) %>%
  step_normalize(all_predictors(), -all_nominal()) %>%
  step_dummy(cut) %>% 
  step_poly(carat, degree = 2) %>%
  prep()

# save the wrangled training data (that was wrangled according to recipe) to object
d_training_preprocessed_by_recipe <-
  diamonds_recipe %>%
  bake(new_data = NULL)

d_training_preprocessed_by_recipe
#> # A tibble: 5,394 x 7
#>    price cut_1  cut_2     cut_3  cut_4 carat_poly_1 carat_poly_2
#>    <dbl> <dbl>  <dbl>     <dbl>  <dbl>        <dbl>        <dbl>
#>  1  6.22 0.632  0.535  3.16e- 1  0.120     -0.0137      0.0110  
#>  2  9.57 0     -0.535 -4.10e-16  0.717      0.0198     -0.00484 
#>  3  6.88 0.316 -0.267 -6.32e- 1 -0.478     -0.00750    -0.000876
#>  4  8.80 0.632  0.535  3.16e- 1  0.120      0.00599    -0.0127  
#>  5  7.27 0.632  0.535  3.16e- 1  0.120     -0.00778    -0.000425
#>  6  9.63 0.316 -0.267 -6.32e- 1 -0.478      0.0389      0.0393  
#>  7  8.29 0.316 -0.267 -6.32e- 1 -0.478      0.00571    -0.0126  
#>  8  8.57 0     -0.535 -4.10e-16  0.717      0.0113     -0.0120  
#>  9  7.41 0.632  0.535  3.16e- 1  0.120     -0.00525    -0.00418 
#> 10  8.07 0.632  0.535  3.16e- 1  0.120      0.00599    -0.0127  
#> # ... with 5,384 more rows

Now let's assume that we want to take the dplyr path instead. This means that we will not use recipes at all. However, splitting to testing and training data is still relevant after we wrangle the data.

library(dplyr)
library(tibble)

# equivalent to `step_dummy()`
mutate_dummy_contr.poly <- function(.dat, colname) {
  colname    <- deparse(substitute(colname))
  col_as_vec <- .dat[[colname]]
  stopifnot(is.factor(col_as_vec))
  
  factor_levels <- levels(col_as_vec)
  
  contr.poly(factor_levels) %>%
    as_tibble() %>%
    setNames(paste(colname, as.character(1:4), sep = "_")) %>%
    add_column("{colname}" := factor_levels, .before = 0) %>%
    left_join(.dat, ., by = colname) %>%
    select(-colname)
}

# equivalent to `step_poly()`
mutate_poly_coefs <- function(.dat, colname, deg) {
  colname <- deparse(substitute(colname))
  poly(x = .dat[[colname]], degree = deg) %>%
    as_tibble() %>%
    setNames(paste(as.character(colname), as.character(1:deg), sep = "_")) %>%
    bind_cols(.dat, .)
}

my_diamonds_preproc_with_dplyr <-
  my_diamonds %>% 
  mutate(across(price, log)) %>%
  mutate(across(carat, ~as.numeric(scale(.)))) %>%
  mutate_dummy_contr.poly(cut) %>%
  mutate_poly_coefs(carat, deg = 2)

init_split_dplyr  <- initial_split(my_diamonds_preproc_with_dplyr, prop = .1)
d_training_dplyr  <- training(init_split_dplyr)
d_testing_dplyr   <- testing(init_split_dplyr)

In summary. the second approach first wrangles the entire my_diamonds data using dplyr and then splitting it to testing and training. What data leakage could possibly happen here, and what learnable parameters do I miss learning in this way?

Otherwise, I prefer writing this wrangling/feature engineering code explicitly, so I can clearly see what operations are carried on the data, instead of "black box" wrappers such as step_*() from recipes.

I would be happy for anyone who can chime in and add their 2 cents. I could not find any intelligent discussion about this topic elsewhere.

Thanks!

1 Like

One issue is the use of scale(). You should center and/or scale your data based on the training set data and apply those same statistics to new data. You shouldn't do any estimation before splitting (or separately for the test set). Some base R functions can do this but only when they are used in the correct context (like the formula method/model.matrix() but not as used above).

It is important to make sure that the estimation is done properly to avoid overfitting/information leakage. This is especially important for imputation, feature selection, and other operations using statistical estimates (scale() is pretty benign).

Also, there are some subtle nuances to things like making dummy variables using base R tools. There are situations where you may not get the same encodings when applying it to new data. This somewhat depends on the format of the data going in, but we've solved a lot of little issues that user's have encountered for the last two decades. Perhaps a low probability of happening but, when it does, it takes a while to debug and becomes a memorable experience.

However, we advise doing operations on the outcomes outside of the recipe. The recipe, when used in other tidymodels functions, firewalls the outcome data when predicting new data (just to make sure that it is not used in any other way). For example, when resampling or tuning, the step_log() in your recipe will fail since price will not be exposed to the recipe.

Finally, one noice thing about a recipe is, for the most part, all of your preprocessing is contained (and documented) in a single object. It helps for traceability and general code cleanliness. A lot of people have had different scripts or functions scattered across different files etc.

Thanks @Max for the detailed response. As a tidymodels (and ML) newbie, I'd like to capitalize on your answer to learn more about the concepts underlying ML work with tidymodels (or ML in general).

Is data normalization (e.g., using scale()) considered a type of estimation?

In what sense is scale() a form of estimation? Coming from regular statistics, data normalization is something I do when I want to standardize different variables so they are in the same units.


Do you happen to have some pointers to this? I mean, any code example, any discussion, Stack Overflow/blog post, book chapter, etc.


Is there an example of how data leakage looks like?

For me at least, learning from example is really helpful. Do you happen to have any reference to data leakage example (e.g., reproducible code of analysis) and ideally how it's avoided using tidymodels tools? Right now the concept of data leakage remains theoretical to me.


How can we learn about the nuances tidymodels solves?

I agree that relying on memorable experience is a problem, because it allows variance in knowledge and rigor, according to the particular situations one has encountered over time. Thus, a unifying framework such as tidymdoels seems like a good solution. However, I'd like to echo the educational purpose of tidymdoels and argue that it would be immensely valuable to understand what are those nuances that tidymodels solves. Otherwise, users' understanding (as myself) will be limited to "this is the way tidymodels does it, and it's a good standard to rely on". While it's true, I do want to deepen my understanding.

So my practical question is: can I access the problems/nuances tidymodels streamlines?


Tidymodels provides a clean and standard interface

Absolutely. The standard code tidymodels provides is a great strength.


In summary

I really need to grasp my learning in actual examples :slight_smile: . Right now, data leakage seems like the major concept I need to understand.

It estimates the standard deviation and mean of the data.

Base R has never done a good job of documenting what terms() does. There is some written in this blog post (see the section The Predictive Nature of the terms).

In a nutshell: a function like scale() stores the estimates as attributes:

> scale(1:3)
     [,1]
[1,]   -1
[2,]    0
[3,]    1
attr(,"scaled:center")
[1] 2
attr(,"scaled:scale")
[1] 1

That, in itself, isn't helpful. When it is used in a formula (e.g y ~ scale(x)) and processed via the whole R terms machinery, it is stored in a way that will apply that same scaling to new data (e.g. when predict() is called).

There are tons of examples (usually under "information leakage"). The best examples of those related to feature selection and how data are used there. A good example is Selection bias in gene extraction on the basis of microarray gene-expression data. We still see this error in the literature constantly; it is not obvious and can lead to huge issues.

There is some written about it here:

To provide a solid methodology, one should constrain themselves to developing the list of preprocessing techniques, estimate them only in the presence of the training data points, and then apply the techniques to future data (including the test set). Arguably, the moving average issue cited above is most likely minor in terms of consequences, but illustrates how easily the test data can creep into the modeling process. The approach to applying this preprocessing technique would be to split the data then apply the moving average smoothers to the training and test sets independently.

Another, more overt path to information leakage, can sometimes be seen in machine learning competitions where the training and test set data are given at the same time. While the test set data often have the outcome data blinded, it is possible to “train to the test” by only using the training set samples that are most similar to the test set data. This may very well improve the model’s performance scores for this particular test set but might ruin the model for predicting on a broader data set.

as well as Section 10.4.

One of the most pervasive issues is that teaching materials usually fit a model then re-predict the modeling data to measure performance. That's fine for simple linear models, but in most other cases, it known to produce vastly over-optimistic model performance statistics (a basic example is here).

There's a book about it. We don't write specific a lot of times because a lot of it is just good practice (e.g. not needing the training set or outcome data to make predictions, etc). Reading the references above, some of it is to only expose code to the appropriate data at the appropriate time.

Wonderful, thank you!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.