Why is my tune_grid so slow compared to a direct use of rpart?

Fabrice · November 21, 2021, 9:31am

Hi,
I'm trying to understand what I'm doing wrong in a simple v-fold cross validation of a tree model with tune_grid. My problem is the very long running time of the grid search compared to what I can get with a direct call to rpart (50 seconds compared to 1 second, roughly).

Here is a simple example:

library(tidymodels)
library(mlbench)
data(PimaIndiansDiabetes)

my_grid <- expand.grid(min_n=2:50)
cv_folds <- vfold_cv(PimaIndiansDiabetes, v = 5, strata="diabetes")
my_model <- decision_tree(cost_complexity=0, min_n=tune()) %>% 
    set_engine("rpart",xval=0) %>% set_mode("classification")

## 51 seconds on my hardware
tune_results <-  my_model %>% tune_grid(diabetes~.,
                                        resamples=cv_folds,
                                        grid=my_grid,
                                        metrics=metric_set(accuracy))

library(rpart)
## 1 second on my hardware
accuracies <- matrix(NA,ncol=length(my_grid$min_n),nrow=5)
for(k in 1:5) {
    train <-  analysis(cv_folds$splits[[k]])
    test <- assessment(cv_folds$splits[[k]])
    for(mi in seq_along(my_grid$min_n)) {
        dt <- rpart(diabetes~.,data=train,control=rpart.control(xval=0,minsplit=my_grid$min_n[mi],cp=0))
        pred <- predict(dt, test, type="class")
        accuracies[k,mi] <- accuracy_vec(test %>% pull(diabetes),pred)
    }
}

I know of course that my direct call to rpart does far less than tune_grid, but the run time difference is very large and as the results are the same, I am under the impression to be missing something. I'm new to tidy models, so that could be something obvious.

I've tested with and without setting xval to 0 in set_engine("rpart",xval=0) in the tidy model part (in similarly in the direct call). In both cases, this increases the total running time by roughly 2 seconds on my computer. It seems to me an indication that tune_grid is spending most of its time in doing something else than fitting the model, which again tends to point to a mistake from my part.

Thanks for any advice!

Fabrice · November 21, 2021, 1:55pm

After some investigation with the profiler, it turns out a lot of time is spent in the garbage collector in tune_grid. I have tracked this down to the non exported form_form function in parsnip (and to its sister function xy_xy). Both functions call system.time to keep the elapsed time. As it is called with its default parameter gcFirst=TRUE, this triggers a gc. Setting gcFirst to FALSE reduces the grid search time for 51 seconds to 18 seconds on this example! I'm going to fill a bug report for parsnip.

According to the profile output, a lot of overhead is still there in some data frame manipulations, but I guess this is the price to pay for all the nice things that are brought by tidy models. That said, I'm still not sure to use the framework correctly and I'll be happy to receive feedback about my demo code.

system · December 12, 2021, 1:55pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.