While I'm not doing formal machine learning but instead something more descriptive, I am having issues scaling up step_impute_knn()
.
For now, I am using:
f_wide_form_knn_imputation <-
function(wide_form,
nthread = parallelly::availableCores(omit = 1),
...) {
impute_rec_bps <-
recipe(
x = wide_form,
...
) %>%
step_impute_knn(
all_predictors(),
options = list(nthread = nthread)
)
wide_form_imputed <-
prep(impute_rec_bps) %>% juice()
return(wide_form_imputed)
}
And given my data set (I'm having to fill in NA's in a couple of hundred columns), the imputation is taking about 8 hours on our hardware. We're about to scale up our data set about 30x, and I can't figure out how to first train step_impute_knn()
and then apply it to the larger data set once trained ... there is something I'm just not understanding in the recipes
documentation.
I have tried prep() %>% bake(new_data = NULL)
or prep() %>% bake(new_data = head(wide_form
and the NA's are not filled in like they are with prep() %>% juice()
.