step_impute_knn() - how to train and then apply for scaling up?

Shorthand · May 10, 2021, 5:44am

While I'm not doing formal machine learning but instead something more descriptive, I am having issues scaling up step_impute_knn().

For now, I am using:

f_wide_form_knn_imputation <-
    function(wide_form,
             nthread = parallelly::availableCores(omit = 1),
             ...) {
     
        impute_rec_bps <-
            recipe(
                x = wide_form,
                ...
            ) %>%
            step_impute_knn(
                all_predictors(),
                options = list(nthread = nthread)
            )
        
        wide_form_imputed <- 
            prep(impute_rec_bps) %>% juice()
        
        return(wide_form_imputed)
        
           
    }

And given my data set (I'm having to fill in NA's in a couple of hundred columns), the imputation is taking about 8 hours on our hardware. We're about to scale up our data set about 30x, and I can't figure out how to first train step_impute_knn() and then apply it to the larger data set once trained ... there is something I'm just not understanding in the recipes documentation.

I have tried prep() %>% bake(new_data = NULL) or prep() %>% bake(new_data = head(wide_form and the NA's are not filled in like they are with prep() %>% juice().

Max · May 10, 2021, 4:38pm

How big are your data (in terms of rows)? At some point searching for nearest neighbors is not efficient.

Shorthand · May 10, 2021, 10:44pm

We're scaling from about 8k rows to about 300k. I'm going to try step_impute_bag() and see how it does ... but all of the other imputations cause issues ... step_impute_linear() doesn't work due to sparseness, and the others wreak havoc with the distributions.

Max · May 10, 2021, 10:57pm

I think that you have reached "at some point". Depending on how many variables are being imputed, step_impute_bag()will also take a while.

Shorthand · May 10, 2021, 11:16pm

I am starting to realize that ... if you have any ideas or directions to punt me in, that would be great.

Max · May 10, 2021, 11:39pm

With that much data, the model variance of an individual unpruned tree should be pretty low. Maybe try using 10ish trees when bagging.

system · May 31, 2021, 11:40pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.