I'm faced with the following challenge and wanted to get your opinion: let's say I would like to fit a XgBoost model on a relatively small dataset with quite a bit of variation. I'm preparing a relatively common recipe for tree-based models such as:
df %>% recipe(target ~ .) %>% # Nominal variables sanity check step_other(all_nominal(), -has_role("outcome"), other = "infrequent_combined", threshold = 0.025) %>% step_novel(all_nominal(), -has_role("outcome"), new_level = "unrecorded_observation") %>% # Numerical variables preprocessing step_medianimpute(all_numeric()) %>% # Nominal variables preprocessing step_unknown(all_nominal(), -has_role("outcome")) %>% step_integer(all_nominal(), -has_role("outcome")) %>% # Final checks step_nzv(all_predictors()) %>% check_missing(all_predictors())
What I would like to do on top of that is add another 'branch' within my recipe where I would do something similar with one more addition:
- Dummy code all nominal variables instead of assigning integers
- Normalize all numerical variables
- Apply PCA on top of all of those features and join it back to the main recipe
- The final set I would like to use for modelling contains of all those basic features from code exerpt above + my PCAs components
The reason for that is that such a framework might increase the stability of the final solution and increase model performance, but the problem is that
recipes doesn't seem to make such a framework easy right now.
step_pcacurrently replaces all input features and that could theoretically be an option to keep them retained but it wouldn't solve the issue entirely
- Is there any way of making those branches in the way I described so that a single recipe could be prepared within way
prepcall instead of having that as two completely separate recipes?