tuning mtry with dummy variables

Hi, when tuning a ranger model with tidymodels, mtry() requires the maximum number of columns to be known, as described here https://forum.posit.co/t/problem-with-dial/108934.

Some examples of this remove the classifier column (see https://www.tmwr.org/tuning.html#tuning-params-tidymodels), and other examples (see https://forum.posit.co/t/problem-with-dial/108934) don't. The choice to remove the column will obviously determine the upper limit for mtry. 1. Is there a right vs wrong way for this please?

In addition, I am wanting to know why the mtry() upper limit is not affected by the step_dummy option in recipe. Converting nominal predictors to dummy variables will increase the number of columns, however the upper limit for mtry does not change. In contrast, if the variables are pre-transformed then the upper limit for mtry does increase. If there were only a small number of highly variable factors, then the number of dummy columns considered at each split could be very small. 2. What is the correct approach please?

This code is adapted from https://www.tmwr.org/. It shows the option of (i) including the classifier, (ii) removing the classifier, (iii) converting to dummy via step_dummy, and (iv) converting to dummy prior to recipe and how the upper limit for mtry changes accordingly.

# load the data and packages
library(tidymodels)
data(ames, package = "modeldata")
ames <- ames %>% mutate(Sale_Price = log10(Sale_Price)) %>% select(c(Sale_Price, c(1,2,17)))

# split the data
ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test  <-  testing(ames_split)

# specify the tuning parameters
rf_spec <- 
  rand_forest(mtry = tune()) %>% 
  set_engine("ranger", respect.unordered.factors = TRUE, regularization.factor = tune("regularization")) %>%
  set_mode("regression")

rf_param <- extract_parameter_set_dials(rf_spec)
rf_param  # mtry has no upper bound yet

# original recipe
rf_rec_original <- 
  recipe(Sale_Price ~ ., data = ames_train)

# with the classifier, the upper limit is 4
updated_param <- 
  workflow() %>% 
  add_model(rf_spec) %>% 
  add_recipe(rf_rec_original) %>% 
  extract_parameter_set_dials() %>% 
  finalize(ames_train)
updated_param %>% extract_parameter_dials("mtry")

# with the classifier removed, the upper limit is 3
updated_param2 <- 
  workflow() %>% 
  add_model(rf_spec) %>% 
  add_recipe(rf_rec_original) %>% 
  extract_parameter_set_dials() %>% 
  finalize(ames_train %>% select(-Sale_Price))
updated_param2 %>% extract_parameter_dials("mtry")

# recipe with dummy transformation
rf_rec_dummy <- 
  recipe(Sale_Price ~ ., data = ames_train) %>% 
  step_dummy(all_nominal_predictors(), keep_original_cols = TRUE) %>% 
  step_zv(all_predictors()) 

# with the classifier removed, and using step_dummy, the upper limit is still 3
updated_param3 <- 
  workflow() %>% 
  add_model(rf_spec) %>% 
  add_recipe(rf_rec_dummy) %>% 
  extract_parameter_set_dials() %>% 
  finalize(ames_train %>% select(-Sale_Price))
updated_param3 %>% extract_parameter_dials("mtry")

# pre-transforming to dummy variables
varnames <- ames %>% select(where(is.factor)) %>% colnames
ames_dummy <- ames %>% select(-any_of(varnames)) %>% bind_cols(
  map_dfc(varnames, function(x) {
    ames %>% select(all_of(x)) %>% varhandle::to.dummy(x) %>% as.data.frame()}) %>%
    mutate(across(everything(), factor)))
ames_dummy_split <- initial_split(ames_dummy, prop = 0.80, strata = Sale_Price)
ames_dummy_train <- training(ames_dummy_split)
ames_dummy_test  <-  testing(ames_dummy_split)

# dummy recipe
rf_rec_dummy <- 
  recipe(Sale_Price ~ ., data = ames_dummy_train)

# with the pre-transformed variables, and the classifier removed, the upper limit is now 33
updated_param4 <- 
  workflow() %>% 
  add_model(rf_spec) %>% 
  add_recipe(rf_rec_dummy) %>% 
  extract_parameter_set_dials() %>% 
  finalize(ames_dummy_train %>% select(-Sale_Price))
updated_param4 %>% extract_parameter_dials("mtry")

Many thanks in advance :slight_smile:

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.