tidymodels: glmnet fit not returning coefficients with best lambda returned from grid search

markhwhiteii · April 9, 2020, 7:10pm

I'm trying to run a LASSO regression and use tidymodels to tune the lambda parameter. See my code below. Link to the data here. I'm largely following the code laid out here.

library(tidyverse)
library(tidymodels)
library(glmnet)
dat <- read_csv("all_dat.csv")
wf <- workflow()

# split test and train ---------------------------------------------------------
set.seed(1839)
dat_split <- dat %>% 
  initial_split(strata = year)

dat_train <- training(dat_split)
dat_test <- testing(dat_split)

# preprocessing ----------------------------------------------------------------
dat_train_prep <- recipe(win_pct ~ ., data = dat_train) %>% 
  step_normalize(all_numeric(), -all_outcomes()) %>% 
  update_role(team, year, new_role = "ID")
  
wf <- wf %>% 
  add_recipe(dat_train_prep)

# specify model ----------------------------------------------------------------
lasso_spec <- linear_reg(penalty = tune(), mixture = 1) %>% 
  set_engine("glmnet")

wf <- wf %>% 
  add_model(lasso_spec)

# tune for lambda --------------------------------------------------------------
lasso_grid <- tune_grid(
  wf,
  resamples = vfold_cv(dat_train, v = 5, strata = year),
  grid = grid_regular(penalty(), levels = 50)
)

lasso_grid %>%
  collect_metrics() %>%
  ggplot(aes(log(penalty), mean, color = .metric)) +
  geom_line() +
  facet_wrap(~.metric, scales = "free", nrow = 2) +
  theme(legend.position = "none")

best_lambda <- lasso_grid %>%
  select_best("rmse")

wf <- finalize_workflow(wf, best_lambda)

# fit model --------------------------------------------------------------------
mod_train <- wf %>% 
  fit(dat_train)

Note that I get these messages while running tune_grid:

! Fold1: internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 error. `NA` will be returned.
! Fold2: internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 error. `NA` will be returned.
! Fold3: internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 error. `NA` will be returned.
! Fold4: internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 error. `NA` will be returned.
! Fold5: internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 error. `NA` will be returned.

I can't find a good explanation what this means, but that is not the issue at hand here.

Now, when I want to look at my model coefficients for the best lambda value, it is not there:

mod_train %>% 
  pull_workflow_fit() %>% 
  tidy() %>% 
  filter(lambda == best_lambda$penalty)

This returns a tibble of no rows.

Why is this the case? Why does it not fit the model with the best lambda found by cross-validation? It seems like this is just running cv.glmnet, if that's the case, then why use tidymodels here?

It looks like the fit call ignores everything I've done so far and just runs vanilla cv.glmnet because my finalized workflow prints:

── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Main Arguments:
       
  penalty = 0.00568986602901831
  mixture = 1

Computational engine: glmnet

But then, after I run fit, it returns:

── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────

Call:  glmnet::glmnet(x = as.matrix(x), y = y, family = "gaussian",      alpha = ~1) 

     Df    %Dev   Lambda
1     0 0.00000 0.084000
2     6 0.07433 0.076540
3     9 0.17220 0.069740
4    14 0.27150 0.063540
5    14 0.36340 0.057900
6    15 0.44060 0.052750
7    15 0.50670 0.048070
8    16 0.56370 0.043800
9    16 0.61120 0.039910
10   16 0.65050 0.036360
11   16 0.68320 0.033130
12   16 0.71030 0.030190
13   16 0.73290 0.027510
14   16 0.75160 0.025060
15   16 0.76710 0.022840
16   16 0.78000 0.020810
17   17 0.79090 0.018960
18   17 0.80010 0.017270
19   17 0.80780 0.015740
20   19 0.81480 0.014340
21   21 0.82080 0.013070
22   23 0.82630 0.011910
23   27 0.83140 0.010850
24   32 0.83660 0.009885
25   33 0.84140 0.009007
26   36 0.84580 0.008207
27   40 0.84970 0.007478
28   46 0.85350 0.006813
29   52 0.85710 0.006208
30   61 0.86050 0.005657
31   66 0.86410 0.005154
32   75 0.86760 0.004696
33   82 0.87120 0.004279
34   88 0.87510 0.003899
35   96 0.87920 0.003552
36  104 0.88290 0.003237
37  110 0.88660 0.002949
38  116 0.89010 0.002687
39  123 0.89370 0.002449
40  127 0.89740 0.002231
41  134 0.90120 0.002033
42  140 0.90460 0.001852
43  148 0.90780 0.001688
44  151 0.91090 0.001538
45  161 0.91380 0.001401
46  171 0.91660 0.001277

...
and 54 more lines.

Which is what I would get if I just ran glmnet::cv.glmnet, so I'm not sure where I'm missing telling it to only fit the model with the best parameter I found from the grid search.

The closest is on step 30 of the fit model where the penalty is 0.005656526. But this is not the penalty I gave it from my own cross-validation using tidymodels, leading me to believe that this ignored that and ran cv.glmnet under the hood instead.

Max · April 10, 2020, 12:57am

See the answer here

system · May 1, 2020, 12:58am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.