How does `step_corr()` pick which variable to keep?

How does recipes::step_corr() decide which variable to keep (e.g. if you have three highly correlated variables above the threshold value which will be kept)?

For instance, in the documentation example @Max creates a duplicate + noise column to carbon. Why does step_corr() keep the duplicate column and drop carbon (when carbon has the slightly stronger relationship to the target variable, HHV). I copied this example below, but increased the noise to further highlight the point:

library(recipes)
data(biomass)

set.seed(3535)
biomass$duplicate <- biomass$carbon + rnorm(nrow(biomass), sd = 10)

biomass_tr <- biomass[biomass$dataset == "Training",]
biomass_te <- biomass[biomass$dataset == "Testing",]

rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen +
                sulfur + duplicate,
              data = biomass_tr)

corr_filter <- rec %>%
  step_corr(all_predictors(), threshold = .5)

filter_obj <- prep(corr_filter, training = biomass_tr)

filtered_te <- bake(filter_obj, biomass_te)
round(abs(cor(biomass_tr[, c(3:9)])), 2)
#>           carbon hydrogen oxygen nitrogen sulfur  HHV duplicate
#> carbon      1.00     0.32   0.63     0.15   0.09 0.92      0.71
#> hydrogen    0.32     1.00   0.54     0.07   0.19 0.23      0.21
#> oxygen      0.63     0.54   1.00     0.18   0.31 0.55      0.45
#> nitrogen    0.15     0.07   0.18     1.00   0.27 0.14      0.08
#> sulfur      0.09     0.19   0.31     0.27   1.00 0.13      0.09
#> HHV         0.92     0.23   0.55     0.14   0.13 1.00      0.68
#> duplicate   0.71     0.21   0.45     0.08   0.09 0.68      1.00


tidy(corr_filter, number = 1)
#> # A tibble: 1 x 2
#>   terms id        
#>   <chr> <chr>     
#> 1 <NA>  corr_dBf6A
tidy(filter_obj, number = 1)
#> # A tibble: 2 x 2
#>   terms  id        
#>   <chr>  <chr>     
#> 1 oxygen corr_dBf6A
#> 2 carbon corr_dBf6A

Created on 2019-12-04 by the reprex package (v0.3.0)

Note on origin of question:

In Feature Engineering... --> section 11.3 --> figure 11.5 --> right hand panel (pertaining to Recursive Feature Elimination with feature importance ranked independently)...

I was confused why --> y values (AUC performance) are so different --> between the red and blue lines (correlation filter applied vs. not), when --> x value is 1 (# of variables in model) ...

I was surprised at the substantially worse performance with the correlation filter and why the model with one variable remaining in RFE (under these specific conditions) is not essentially the same? (Hence question above on step_corr() and curiosity on algorithm/considerations in selection methods between collinear features.)

Figure from book:

The correlation filter is unsupervised so it does not consider the outcome at all. This is the reason that a pre-filter does poorly in the RFE analysis. For example, the one predictor model with the filter probably does worse because it removed a predictor that would reduce correlation with not consideration of predictive performance.

In general, the filter tries to prioritize predictors for removal based on the global affect on the overall correlation structure. If you had two identical predictors, there is no real rule on which one to retain (it probably gets rid of the first one or something like that).

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.