I am trying to learn how to use
recipes to do an initial set of preprocessing steps, but I'm having a hard time figuring out how to define an 'id' (character) column which should NOT be processed or changed in any way.
Currently, I can
prep a recipe using a training dataset, and the
id column is changed from
factor (undesired behavior, but not terrible). However, when I
bake new datasets (like a validation or testing dataset), the
id's all get converted to NA.
Sorry for not having a reproducible example, but here's the relevant code.
Is there a way to just have
id be completely unaffected by the recipe, not being changed from character to factor, and not being touched when baked from new datasets?
rec_obj <- recipe(x = df_train) %>% update_role(next_result, new_role = 'outcome') %>% # set the outcome variable update_role(id, new_role = "id variable") %>% # id is NOT a predictor, and should NOT be touched update_role(time_step, new_role = "timestep variable") %>% # time_step is NOT a predictor, and should NOT be touched update_role(-next_result, -id, -time_step, new_role = 'predictor') %>% # everything else is a predictor step_dummy(enchosp) %>% # this predictor is a factor and should be encoded with dummy variables step_center(pred1, pred2, pred3, pred4, pred5) %>% # center + scale the numeric predictors step_scale(pred1, pred2, pred3, pred4, pred5) %>% step_medianimpute(all_numeric()) # median impute missing numbers rec_trained <- prep(rec_obj, training = df_train) train_data <- bake(rec_trained, new_data = df_train) validate_data <- bake(rec_trained, new_data = df_validate) test_data <- bake(rec_trained, new_data = df_test)
Incidentally, the reason I want the id's to remain after preprocessing is that I need to subsequently do some heavy processing on the datasets, generating padded and windowed time-series data from each id and its own time steps, and then AFTER that time series processing has occurred, THEN I'll remove the id for feeding into an LSTM neural network for modeling / testing.