Hello!
I'm faced with the following challenge and wanted to get your opinion: let's say I would like to fit a XgBoost model on a relatively small dataset with quite a bit of variation. I'm preparing a relatively common recipe for tree-based models such as:
df %>%
recipe(target ~ .) %>%
# Nominal variables sanity check
step_other(all_nominal(), -has_role("outcome"), other = "infrequent_combined", threshold = 0.025) %>%
step_novel(all_nominal(), -has_role("outcome"), new_level = "unrecorded_observation") %>%
# Numerical variables preprocessing
step_medianimpute(all_numeric()) %>%
# Nominal variables preprocessing
step_unknown(all_nominal(), -has_role("outcome")) %>%
step_integer(all_nominal(), -has_role("outcome")) %>%
# Final checks
step_nzv(all_predictors()) %>%
check_missing(all_predictors())
What I would like to do on top of that is add another 'branch' within my recipe where I would do something similar with one more addition:
- Dummy code all nominal variables instead of assigning integers
- Normalize all numerical variables
- Apply PCA on top of all of those features and join it back to the main recipe
- The final set I would like to use for modelling contains of all those basic features from code exerpt above + my PCAs components
The reason for that is that such a framework might increase the stability of the final solution and increase model performance, but the problem is that recipes
doesn't seem to make such a framework easy right now.
-
step_pca
currently replaces all input features and that could theoretically be an option to keep them retained but it wouldn't solve the issue entirely - Is there any way of making those branches in the way I described so that a single recipe could be prepared within way
prep
call instead of having that as two completely separate recipes?