Building branches of data preprocessing within recipes

Hello!

I'm faced with the following challenge and wanted to get your opinion: let's say I would like to fit a XgBoost model on a relatively small dataset with quite a bit of variation. I'm preparing a relatively common recipe for tree-based models such as:

df %>% 
    recipe(target ~ .) %>% 
    
    # Nominal variables sanity check
    step_other(all_nominal(), -has_role("outcome"), other = "infrequent_combined", threshold = 0.025) %>% 
    step_novel(all_nominal(), -has_role("outcome"), new_level = "unrecorded_observation") %>% 
    
    # Numerical variables preprocessing
    step_medianimpute(all_numeric()) %>%
    
    # Nominal variables preprocessing
    step_unknown(all_nominal(), -has_role("outcome")) %>%
    step_integer(all_nominal(), -has_role("outcome")) %>% 
    
    # Final checks
    step_nzv(all_predictors()) %>% 
    check_missing(all_predictors())

What I would like to do on top of that is add another 'branch' within my recipe where I would do something similar with one more addition:

  1. Dummy code all nominal variables instead of assigning integers
  2. Normalize all numerical variables
  3. Apply PCA on top of all of those features and join it back to the main recipe
  4. The final set I would like to use for modelling contains of all those basic features from code exerpt above + my PCAs components

The reason for that is that such a framework might increase the stability of the final solution and increase model performance, but the problem is that recipes doesn't seem to make such a framework easy right now.

  1. step_pca currently replaces all input features and that could theoretically be an option to keep them retained but it wouldn't solve the issue entirely
  2. Is there any way of making those branches in the way I described so that a single recipe could be prepared within way prep call instead of having that as two completely separate recipes?

There is an issue open to make this happen and it should be fairly simple. Just haven't gotten to it yet. Also, I've been experimenting with letting num_comp = 0 as a way of skipping PCA but that's fairly dangerous (but let's keep that between us :grinning:)

No and that's a weakness with recipes. You've have to have two recipes.

1 Like

Thanks for a prompt reply!

There is an issue open to make this happen and it should be fairly simple. Just haven't gotten to it yet. Also, I've been experimenting with letting num_comp = 0 as a way of skipping PCA but that's fairly dangerous (but let's keep that between us :grinning:)

It actually probably would be a good idea if for most multivariate preprocessing steps (or some others as well) there would be an option to retain the original features. This would allow for easier creation of more complex recipes in a much easier way :slight_smile:

No and that's a weakness with recipes . You've have to have two recipes.

I see. Would you say that it is planned by design or do you see it as a possibility of 'stacking' recipes together at some point within recipes?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.