resampling options in trainControl() and safsControl() with safs()

brg · March 23, 2023, 7:18pm

i am using caret::safs() for some supervised feature selection, and trying to better understand how to set the resampling scheme using trainControl and safsControl- both seem to have options to set the resampling method, number and repeats. I've been reading through the docs and what examples I can find, and I'm not totally clear on if I need to set the resampling scheme in both or just one.

The caret package book even notes the options are similar between the 2 functions:

Some important options to safsControl are:

method, number, repeats, index, indexOut, etc: options similar to those for train top control resampling.

my questions boil down to the following:

if resampling should be set in both, why? and,
if resampling just needs to be defined in one of them, which one?

Here's a non-working representative example of the code i'm using to conduct safs:

#set resampling scheme in trainControl
train_ctrl <- trainControl(method = "repeatedcv", 
                           number = 10, 
                           repeats = 3,
                           classProbs = TRUE,
                           summaryFunction = twoClassSummary,
                           savePredictions = "final",
                           allowParallel = FALSE  #FALSE here but TRUE below so as to not square number of workers
                          )

caretSA$fitness_extern <-  twoClassSummary

# also set it in in safsControl - is this needed?
safs_ctrl <- safsControl(functions = caretSA,
                         method = "repeatedcv",
                         number = 10,
                         repeats = 3,
                         metric = c(internal = "ROC", external = "ROC"),
                         maximize = c(internal = TRUE, external = TRUE),
                         allowParallel = TRUE,
                         verbose = TRUE)

sa_results <- safs(my_recipe, 
                   data = training_data,
                   iters = 10, 
                   method = "glm", 

                  # are both of these needed???
                   trControl = train_ctrl,
                   safsControl = safs_ctrl)

Max · March 23, 2023, 8:03pm

I think that, for this analysis, you don't need the extra layer of resampling (for the GLM fit) since you are not tuning it. AFAICR, you can use method = "none" in trainControl().

If you were tuning, then you would need to specify something other than "none".

brg · March 23, 2023, 9:58pm

Thanks @Max. i tried removing it from trainControl() but am getting an error task 1 failed - "replacement has 1 row, data has 0" - see working reprex below that compares using "cv" vs "none" in trainControl().

if using trainControl(method = "cv", ... ) is doing a lot of extra unnecessary work in this case and slowing things down would be nice to be able to remove this step, but if its innocuous i'm just as happy to leave it in there.

library(tidymodels)
library(caret)

data("credit_data")

set.seed(1)
split <- initial_split(credit_data, strata = Status)
train_set <- training(split)
test_set <- testing(split)

rec <- 
  recipe(Status ~ ., data = train_set) %>% 
  step_nzv(all_numeric_predictors()) %>%
  step_unknown(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors())

bal_accuracy <- 
  function(data, lev = NULL, model = NULL) {
    
    sens <- caret::sensitivity(data$pred, data$obs, positive = "bad")
    spec <- caret::specificity(data$pred, data$obs, positive = "bad")
    
    bal_accuracy <- (sens + spec) / 2
    
    c(bal_accuracy = bal_accuracy)
  }

train_ctrl_resampling <-
  trainControl(method = "cv",
               number = 5,
               classProbs = FALSE,
               summaryFunction =  bal_accuracy,
               savePredictions = "final",
               allowParallel = FALSE 
  )

train_ctrl_no_resampling <-
  trainControl(method = "none",
               classProbs = FALSE,
               summaryFunction =  bal_accuracy, #previously: twoClassSummary,
               savePredictions = "final",
               allowParallel = FALSE 
  )


caretSA$fitness_extern <- bal_accuracy #previously: twoClassSummary


safs_ctrl <-
  safsControl(functions = caretSA,
              method = "cv",
              number = 5,
              metric = c(internal = "bal_accuracy", external = "bal_accuracy"),
              maximize = c(internal = TRUE, external = TRUE),
              #improve = 5, #how many to try after new best is found
              allowParallel = FALSE,
              verbose = FALSE)


sa_results_training_resamples <- safs(rec, 
                                      data = train_set,
                                      iters = 5,
                                      method = "glm", 
                                      trControl = train_ctrl_resampling,
                                      safsControl = safs_ctrl)


sa_results_training_resamples
#> 
#> Simulated Annealing Feature Selection
#> 
#> 3340 samples
#> 25 predictors
#> 2 classes: 'bad', 'good' 
#> 
#> Maximum search iterations: 5 
#> 
#> Internal performance value: bal_accuracy
#> Subset selection driven to maximize internal bal_accuracy 
#> 
#> External performance value: bal_accuracy
#> Best iteration chose by maximizing external bal_accuracy 
#> External resampling method: Cross-Validated (5 fold) 
#> 
#> During resampling:
#>   * the top 5 selected variables (out of a possible 25):
#>     Expenses (80%), Job_others (60%), Price (60%), Age (40%), Home_other (40%)
#>   * on average, 7.2 variables were selected (min = 6, max = 10)
#> 
#> In the final search using the entire training set:
#>    * 8 features selected at iteration 5 including:
#>      Seniority, Income, Assets, Home_other, Home_priv ... 
#>    * external performance at this iteration is
#> 
#> bal_accuracy 
#>       0.5607


#fails when trying to train_ctrl_no_resampling
sa_results_no_training_resamples <- 
  safs(rec, 
       data = train_set,
       iters = 5,
       method = "glm", 
       trControl = train_ctrl_no_resampling,
       safsControl = safs_ctrl)
#> Error in {: task 1 failed - "replacement has 1 row, data has 0"

sa_results_no_training_resamples
#> Error in eval(expr, envir, enclos): object 'sa_results_no_training_resamples' not found

Max · March 23, 2023, 11:31pm

I was off; you do need an internal performance metric for the inner loop. Check out the caret website.

For some models, like random forests, there is a way to get a good performance estimate without more resampling (via the out-of-bag error). Otherwise, you should resample.

The inner resampling directs the GA to seek better subsets. The outer resampling loop tells you when to stop. Neither stage of resampling can do the work for both. That's also discussed in FES if you need a better explanation.

brg · March 24, 2023, 1:26pm

makes sense, thanks! can i also ask if you have any advice regarding the model type to use with SAFS or GAFS? I initially picked glm partly because it doesn't do any feature selection (didn't want to comingle model feature selection with the SAFS feature selection), but am wondering if there are better options?

Also I refer to FES and applied predictive modeling almost daily and wanted to thank you for how awesome they are!

Max · March 24, 2023, 11:54pm

Honestly, if you are sticking with a glm model, the best approach is probably to use a LASSO penalty via glmnet. It will be much faster and give a lot more control over what gets eliminated.

system · March 31, 2023, 11:54pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.