Tuning number of variables and/or formulas

Perhaps, this is a stupid question. So, you can tune parameters for various models through cross-validation. Is it possible to tune the number of variables in the model ("formula") through cross-validation?

The code is just pseudocode for illustration. For example, in the iris dataset, a classification logistic formula would be

Species ~.

predicting outcome Species using all other variables.

If i am interested in understanding what is the effect of including or excluding different variables? ie; tune through

Species ~ Sepal.Length+Sepal.Width
Species ~ Sepal.Length+Petal.Width
Species ~ Sepal.Length+Sepal.Width+Petal.length

etc. changing number of variables and combinations of variables. And then get some sort of summary statistic for each evaluation.

Or another way to look at this. If there are perhaps too many variables/ combinations to traverse through. I could have a manual list of formulas.

formulas_of_interest <- c(Species ~ Sepal.Length+Sepal.Width, Species ~ Sepal.Length+Petal.Width, Species ~ Sepal.Length+Sepal.Width+Petal.length)

 for i in seq_along(formulas_of_interest) {
  run_modelling(formula=formulas_of_interest[i])
}

How would one go about doing something like this using tidymodels?

Over the break I've been on a package to do this called workflowsets. It can make different combinations of models and formulas (and other stuff too).

It's beyond experimental but the api might change slightly as people begin to use it:

# will require some devel tidymodels packages:
# remotes::install_github("tidymodels/workflowsets")

library(tidymodels)
#> ── Attaching packages ────────────────────────────────────── tidymodels 0.1.2 ──
#> ✓ broom     0.7.3           ✓ recipes   0.1.15.9000
#> ✓ dials     0.0.9.9000      ✓ rsample   0.0.8      
#> ✓ dplyr     1.0.2           ✓ tibble    3.0.4      
#> ✓ ggplot2   3.3.3           ✓ tidyr     1.1.2      
#> ✓ infer     0.5.3           ✓ tune      0.1.2.9000 
#> ✓ modeldata 0.1.0.9000      ✓ workflows 0.2.1      
#> ✓ parsnip   0.1.4.9000      ✓ yardstick 0.0.7.9000 
#> ✓ purrr     0.3.4
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter()  masks stats::filter()
#> x dplyr::lag()     masks stats::lag()
#> x recipes::step()  masks stats::step()
library(workflowsets)
# Define a list of preprocessors, such as formulas or recipes

formulas <- 
   list(
      mod_1 = Species ~ Sepal.Length+Sepal.Width,
      mod_2 = Species ~ Sepal.Length+Petal.Width,
      mod_3 = Species ~ Sepal.Length+Sepal.Width+Petal.Length
   )

# Define a model to use
model_spec <- multinom_reg() %>% set_engine("nnet", trace = 0)

# Combine a list of models and list of preprocessors
iris_set <- workflow_set(formulas, models = list(glm = model_spec))
iris_set
#> # A workflow set/tibble: 3 x 6
#>   wflow_id  preproc model        object     option     result    
#>   <chr>     <chr>   <chr>        <list>     <list>     <list>    
#> 1 mod_1_glm formula multinom_reg <workflow> <list [0]> <list [0]>
#> 2 mod_2_glm formula multinom_reg <workflow> <list [0]> <list [0]>
#> 3 mod_3_glm formula multinom_reg <workflow> <list [0]> <list [0]>
# Evaluate them using the bootstrap:
set.seed(1)
bt <- bootstraps(iris, times = 50)

iris_results <- 
   iris_set %>% 
   workflow_map("fit_resamples", resamples = bt, seed = 2)
#> 
#> Attaching package: 'rlang'
#> The following objects are masked from 'package:purrr':
#> 
#>     %@%, as_function, flatten, flatten_chr, flatten_dbl, flatten_int,
#>     flatten_lgl, flatten_raw, invoke, list_along, modify, prepend,
#>     splice
#> 
#> Attaching package: 'vctrs'
#> The following object is masked from 'package:tibble':
#> 
#>     data_frame
#> The following object is masked from 'package:dplyr':
#> 
#>     data_frame
# Show the results using any of these function

collect_metrics(iris_results) %>% 
   arrange(.metric)
#> # A tibble: 6 x 9
#>   wflow_id  .config       preproc model   .metric .estimator  mean     n std_err
#>   <chr>     <chr>         <chr>   <chr>   <chr>   <chr>      <dbl> <int>   <dbl>
#> 1 mod_1_glm Preprocessor… formula multin… accura… multiclass 0.784    50 6.23e-3
#> 2 mod_2_glm Preprocessor… formula multin… accura… multiclass 0.956    50 3.45e-3
#> 3 mod_3_glm Preprocessor… formula multin… accura… multiclass 0.954    50 3.68e-3
#> 4 mod_1_glm Preprocessor… formula multin… roc_auc hand_till  0.922    50 3.14e-3
#> 5 mod_2_glm Preprocessor… formula multin… roc_auc hand_till  0.993    50 7.90e-4
#> 6 mod_3_glm Preprocessor… formula multin… roc_auc hand_till  0.995    50 8.43e-4

rank_results(iris_results, rank_metric = "accuracy")
#> # A tibble: 6 x 9
#>   wflow_id  .config       .metric  mean std_err     n model   preprocessor  rank
#>   <chr>     <chr>         <chr>   <dbl>   <dbl> <int> <chr>   <chr>        <int>
#> 1 mod_2_glm Preprocessor… accura… 0.956 3.45e-3    50 multin… formula          1
#> 2 mod_2_glm Preprocessor… roc_auc 0.993 7.90e-4    50 multin… formula          1
#> 3 mod_3_glm Preprocessor… accura… 0.954 3.68e-3    50 multin… formula          2
#> 4 mod_3_glm Preprocessor… roc_auc 0.995 8.43e-4    50 multin… formula          2
#> 5 mod_1_glm Preprocessor… accura… 0.784 6.23e-3    50 multin… formula          3
#> 6 mod_1_glm Preprocessor… roc_auc 0.922 3.14e-3    50 multin… formula          3

Created on 2021-01-07 by the reprex package (v0.3.0)

We're looking for feedback so please file issues or suggestions at the GH site.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.