I am trying to learn how to use recipes
to do an initial set of preprocessing steps, but I'm having a hard time figuring out how to define an 'id' (character) column which should NOT be processed or changed in any way.
Currently, I can prep
a recipe using a training dataset, and the id
column is changed from character
to factor
(undesired behavior, but not terrible). However, when I bake
new datasets (like a validation or testing dataset), the id
's all get converted to NA.
Sorry for not having a reproducible example, but here's the relevant code.
Is there a way to just have id
be completely unaffected by the recipe, not being changed from character to factor, and not being touched when baked from new datasets?
rec_obj <- recipe(x = df_train) %>%
update_role(next_result, new_role = 'outcome') %>% # set the outcome variable
update_role(id, new_role = "id variable") %>% # id is NOT a predictor, and should NOT be touched
update_role(time_step, new_role = "timestep variable") %>% # time_step is NOT a predictor, and should NOT be touched
update_role(-next_result, -id, -time_step, new_role = 'predictor') %>% # everything else is a predictor
step_dummy(enchosp) %>% # this predictor is a factor and should be encoded with dummy variables
step_center(pred1, pred2, pred3, pred4, pred5) %>% # center + scale the numeric predictors
step_scale(pred1, pred2, pred3, pred4, pred5) %>%
step_medianimpute(all_numeric()) # median impute missing numbers
rec_trained <- prep(rec_obj, training = df_train)
train_data <- bake(rec_trained, new_data = df_train)
validate_data <- bake(rec_trained, new_data = df_validate)
test_data <- bake(rec_trained, new_data = df_test)
Incidentally, the reason I want the id's to remain after preprocessing is that I need to subsequently do some heavy processing on the datasets, generating padded and windowed time-series data from each id and its own time steps, and then AFTER that time series processing has occurred, THEN I'll remove the id for feeding into an LSTM neural network for modeling / testing.