How to make my hyperparameter tuning faster

My tuning takes so long that it does not finish. I have tried tune_bayes() and tune_grid()

here is my script

library(tidyverse)
library(scales)
library(skimr)
library(tidymodels)
library(bonsai)
library(lightgbm)
library(janitor) 
library(doParallel)
options(width = 120)

# read csv data with readr
train_df <- read_csv("raw_data/train.csv") |> 
  clean_names()
test_df <- read_csv("raw_data/test.csv") |> 
  clean_names()

# skimr
skim(train_df)

skim(train_df) |> 
  as_tibble() |> 
  arrange(complete_rate) |>
  select(skim_variable, complete_rate)

rm_cols_missing <- skim(train_df) |> 
  as_tibble() |> 
  arrange(complete_rate) |>
  select(skim_variable, complete_rate) |>
  filter(complete_rate < 0.5) |>
  pull(skim_variable) 

rm_n_unique <- skim(train_df) |> 
  as_tibble() |> 
  arrange(character.n_unique) |>
  select(skim_variable, character.n_unique) |>
  filter(character.n_unique < 3) |>
  pull(skim_variable)

# ecdf of the target variable vs the normal cdf

# trans form target to log
rm_n_unique <- purrr::discard(rm_n_unique, .p = ~stringr::str_detect(.x,"alley"))
target_recipe <- recipe(train_df, sale_price ~ .) %>%
  step_rm(id) %>% 
  step_rm(all_of(rm_cols_missing)) %>%
  step_rm(all_of(rm_n_unique)) %>% # maybe don't remove central_air
  step_log(all_numeric(), offset = 1) %>% # log + 1
  step_normalize(all_numeric(),-all_outcomes())  %>%
  step_other(all_nominal(), -all_outcomes(), threshold = 0.03) %>% # rare levels to other
  step_novel(all_predictors(), -all_numeric()) %>% # assign a previously unseen factor level to a new value
  step_impute_knn(all_predictors()) %>% # use knn to impute missing values
  step_dummy(all_nominal(), -all_outcomes())  # make dummies for categorical variables

prep(target_recipe, training = train_df) %>% 
  juice() %>% 
  glimpse()

# lgbm params
model_lgbm <- boost_tree(
  trees = tune(), learn_rate = tune(),
  tree_depth = tune(), min_n = tune(),
  loss_reduction = tune(), 
  sample_size = tune(), mtry = tune(), 
  ) %>% 
  set_mode("regression") %>% 
  set_engine("lightgbm", nthread = 10)

SalePrice_workflow <- workflow() %>% add_recipe(target_recipe)
SalePrice_xgb_workflow <-SalePrice_workflow %>% add_model(model_lgbm)

# hyperparameters
hyperparams_lgbm <- parameters(
  trees(), learn_rate(),
  tree_depth(), min_n(), 
  loss_reduction(),
  sample_size = sample_prop(), finalize(mtry(), train_df)  
)
xgboost_params <- hyperparams_lgbm %>% update(trees = trees(c(100, 500))) 

set.seed(321)
folds_sale_price <- vfold_cv(train_df, v = 5, strata = sale_price)

# increment workflow
workflow_SalePrice_xgb_model <- 
  workflow() %>% 
  add_model(model_lgbm) %>% 
  add_recipe(target_recipe)

set.seed(42)
doParallel::registerDoParallel(10)
lgbm_tune <-
  workflow_SalePrice_xgb_model %>%
  tune_bayes(
    resamples = folds_sale_price,
    param_info = hyperparams_lgbm,
    initial = 10,
    iter = 30, 
    metrics = metric_set(rmse, mape),
    control = control_bayes(no_improve = 5, 
                            save_pred = T, verbose = T)
  )
doParallel::stopImplicitCluster()
show_notes(lgbm_tune)
 
 SalePrice_best_model <- select_best(xgboost_tune, "rmse", maximize = F)
 print(SalePrice_best_model)
# 
 SalePrice_final_model <- finalize_model(SalePrice_xgb_model, SalePrice_best_model)
 SalePrice_workflow    <- workflow_SalePrice_xgb_model %>% update_model(SalePrice_final_model)
 SalePrice_xgb_fit     <- fit(SalePrice_workflow, data = train_data)
# 
 pred <- 
   predict(SalePrice_xgb_fit, test_data)

readr::write_csv(pred, "pred.csv")

My data source comes from Kaggle-House Price Advanced Regression Techniques

A few things might be slowing it down.

First, do all of the variables require imputation? You are embedding a lot of sub-models in the workflow. KNN is fast but it is more to do.

Second, you are using nested parallelism. I would parallelize either the resampling/tuning or the model fit, not both. It should most likely be the tuning loop (via doParallel). See this post.

Right now you are asking the 10 parallel workers to each get to more cores running in parallel. That's a problem, esp for Windows.

Also, we don't know much about your machine. If the data is X GB in memory, using 10 parallel workers means that you have a total of X + 1 copies of the data in memory. Is the machine okay with that?

1 Like

Thanks.

Here is my PC info.

       _,met$$$$$gg.          user@debian 
    ,g$$$$$$$$$$$$$$$P.       ----------- 
  ,g$$P"     """Y$$.".        OS: Debian GNU/Linux 11 (bullseye) x86_64 
 ,$$P'              `$$$.     Host: B450 I AORUS PRO WIFI 
',$$P       ,ggs.     `$$b:   Kernel: 5.10.0-21-amd64 
`d$$'     ,$P"'   .    $$$    Uptime: 7 days, 5 hours, 21 mins 
 $$P      d$'     ,    $$P    Packages: 2894 (dpkg), 10 (flatpak) 
 $$:      $$.   -    ,d$$'    Shell: bash 5.1.4 
 $$;      Y$b._   _,d$P'      Resolution: 2560x1440 
 Y$$.    `.`"Y$$$$P"'         DE: GNOME 3.38.6 
 `$$b      "-.__              WM: Mutter 
  `Y$$                        WM Theme: Adwaita 
   `Y$$.                      Theme: Everforest-Dark-BL [GTK2/3] 
     `$$b.                    Icons: Adwaita [GTK2/3] 
       `Y$$b.                 Terminal: gnome-terminal 
          `"Y$b._             CPU: AMD Ryzen 5 1600 (12) @ 3.200GHz 
              `"""            GPU: NVIDIA GeForce GTX 1070 
                              Memory: 4426MiB / 16018MiB
  1. I don't think the problem is the KNN imputation because I did juice the recipe earlier in the code and was not hung up, but maybe that does not capture all of the operations in the workflow.

  2. I am guessing you are correct with the nested parallelism. I changed the following line and the program hung for many hours. I had to terminate it.

set_engine("lightgbm", nthread = 4) # will change to nthread = 1

I will read through your linked post to see what else I can do.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.