I'm trying to run a LASSO regression and use tidymodels to tune the lambda parameter. See my code below. Link to the data here. I'm largely following the code laid out here.
library(tidyverse)
library(tidymodels)
library(glmnet)
dat <- read_csv("all_dat.csv")
wf <- workflow()
# split test and train ---------------------------------------------------------
set.seed(1839)
dat_split <- dat %>%
initial_split(strata = year)
dat_train <- training(dat_split)
dat_test <- testing(dat_split)
# preprocessing ----------------------------------------------------------------
dat_train_prep <- recipe(win_pct ~ ., data = dat_train) %>%
step_normalize(all_numeric(), -all_outcomes()) %>%
update_role(team, year, new_role = "ID")
wf <- wf %>%
add_recipe(dat_train_prep)
# specify model ----------------------------------------------------------------
lasso_spec <- linear_reg(penalty = tune(), mixture = 1) %>%
set_engine("glmnet")
wf <- wf %>%
add_model(lasso_spec)
# tune for lambda --------------------------------------------------------------
lasso_grid <- tune_grid(
wf,
resamples = vfold_cv(dat_train, v = 5, strata = year),
grid = grid_regular(penalty(), levels = 50)
)
lasso_grid %>%
collect_metrics() %>%
ggplot(aes(log(penalty), mean, color = .metric)) +
geom_line() +
facet_wrap(~.metric, scales = "free", nrow = 2) +
theme(legend.position = "none")
best_lambda <- lasso_grid %>%
select_best("rmse")
wf <- finalize_workflow(wf, best_lambda)
# fit model --------------------------------------------------------------------
mod_train <- wf %>%
fit(dat_train)
Note that I get these messages while running tune_grid:
! Fold1: internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 error. `NA` will be returned.
! Fold2: internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 error. `NA` will be returned.
! Fold3: internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 error. `NA` will be returned.
! Fold4: internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 error. `NA` will be returned.
! Fold5: internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 error. `NA` will be returned.
I can't find a good explanation what this means, but that is not the issue at hand here.
Now, when I want to look at my model coefficients for the best lambda value, it is not there:
mod_train %>%
pull_workflow_fit() %>%
tidy() %>%
filter(lambda == best_lambda$penalty)
This returns a tibble of no rows.
Why is this the case? Why does it not fit the model with the best lambda found by cross-validation? It seems like this is just running cv.glmnet
, if that's the case, then why use tidymodels here?
It looks like the fit
call ignores everything I've done so far and just runs vanilla cv.glmnet
because my finalized workflow prints:
── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Main Arguments:
penalty = 0.00568986602901831
mixture = 1
Computational engine: glmnet
But then, after I run fit, it returns:
── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────
Call: glmnet::glmnet(x = as.matrix(x), y = y, family = "gaussian", alpha = ~1)
Df %Dev Lambda
1 0 0.00000 0.084000
2 6 0.07433 0.076540
3 9 0.17220 0.069740
4 14 0.27150 0.063540
5 14 0.36340 0.057900
6 15 0.44060 0.052750
7 15 0.50670 0.048070
8 16 0.56370 0.043800
9 16 0.61120 0.039910
10 16 0.65050 0.036360
11 16 0.68320 0.033130
12 16 0.71030 0.030190
13 16 0.73290 0.027510
14 16 0.75160 0.025060
15 16 0.76710 0.022840
16 16 0.78000 0.020810
17 17 0.79090 0.018960
18 17 0.80010 0.017270
19 17 0.80780 0.015740
20 19 0.81480 0.014340
21 21 0.82080 0.013070
22 23 0.82630 0.011910
23 27 0.83140 0.010850
24 32 0.83660 0.009885
25 33 0.84140 0.009007
26 36 0.84580 0.008207
27 40 0.84970 0.007478
28 46 0.85350 0.006813
29 52 0.85710 0.006208
30 61 0.86050 0.005657
31 66 0.86410 0.005154
32 75 0.86760 0.004696
33 82 0.87120 0.004279
34 88 0.87510 0.003899
35 96 0.87920 0.003552
36 104 0.88290 0.003237
37 110 0.88660 0.002949
38 116 0.89010 0.002687
39 123 0.89370 0.002449
40 127 0.89740 0.002231
41 134 0.90120 0.002033
42 140 0.90460 0.001852
43 148 0.90780 0.001688
44 151 0.91090 0.001538
45 161 0.91380 0.001401
46 171 0.91660 0.001277
...
and 54 more lines.
Which is what I would get if I just ran glmnet::cv.glmnet
, so I'm not sure where I'm missing telling it to only fit the model with the best parameter I found from the grid search.
The closest is on step 30 of the fit model where the penalty is 0.005656526. But this is not the penalty I gave it from my own cross-validation using tidymodels, leading me to believe that this ignored that and ran cv.glmnet
under the hood instead.