How does recipes::step_corr()
decide which variable to keep (e.g. if you have three highly correlated variables above the threshold
value which will be kept)?
For instance, in the documentation example @Max creates a duplicate
+ noise column to carbon
. Why does step_corr()
keep the duplicate
column and drop carbon
(when carbon
has the slightly stronger relationship to the target variable, HHV
). I copied this example below, but increased the noise to further highlight the point:
library(recipes)
data(biomass)
set.seed(3535)
biomass$duplicate <- biomass$carbon + rnorm(nrow(biomass), sd = 10)
biomass_tr <- biomass[biomass$dataset == "Training",]
biomass_te <- biomass[biomass$dataset == "Testing",]
rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen +
sulfur + duplicate,
data = biomass_tr)
corr_filter <- rec %>%
step_corr(all_predictors(), threshold = .5)
filter_obj <- prep(corr_filter, training = biomass_tr)
filtered_te <- bake(filter_obj, biomass_te)
round(abs(cor(biomass_tr[, c(3:9)])), 2)
#> carbon hydrogen oxygen nitrogen sulfur HHV duplicate
#> carbon 1.00 0.32 0.63 0.15 0.09 0.92 0.71
#> hydrogen 0.32 1.00 0.54 0.07 0.19 0.23 0.21
#> oxygen 0.63 0.54 1.00 0.18 0.31 0.55 0.45
#> nitrogen 0.15 0.07 0.18 1.00 0.27 0.14 0.08
#> sulfur 0.09 0.19 0.31 0.27 1.00 0.13 0.09
#> HHV 0.92 0.23 0.55 0.14 0.13 1.00 0.68
#> duplicate 0.71 0.21 0.45 0.08 0.09 0.68 1.00
tidy(corr_filter, number = 1)
#> # A tibble: 1 x 2
#> terms id
#> <chr> <chr>
#> 1 <NA> corr_dBf6A
tidy(filter_obj, number = 1)
#> # A tibble: 2 x 2
#> terms id
#> <chr> <chr>
#> 1 oxygen corr_dBf6A
#> 2 carbon corr_dBf6A
Created on 2019-12-04 by the reprex package (v0.3.0)