Duplicate glmnet result with tidymodel

library(ggplot2)
library(tidymodels)
library(glmnet)

df <- ggplot2::mpg

x <- model.matrix(cty ~ hwy + displ, df)
y <- df$cty


result1 <- glmnet(x, y, alpha=0, lambda=0)
tidy(result1)

result2 <- linear_reg(mixture=0, penalty = 0) %>%
  set_engine('glmnet') %>%
  fit(cty ~ hwy + displ, data=df)

tidy(result2)

I think result1 and result2 are supposed to be the same and yet R gives very different result. Could someone explain why there is a difference and how to replicate the result from glmnet with the tidymodel way?

This has to do with how we fit the glmnet model in tidymodels. Even when you give a penalty, we let it compete the full path (see the backstory here). We do say in the help file that:

For glmnet models, the full regularization path is always fit regardless of the value given to penalty .

This changes the lambda path in the objects:

> result1$lambda
[1] 0
> str(result2$fit$lambda)
 num [1:100] 4060 3699 3370 3071 2798 ...
> summary(result2$fit$lambda)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
   0.406    4.063   40.640  456.932  406.297 4059.624 

For this type of model, when a penalty value that is not included in the path is used, it does linear interpolation between the values. Since the path values do not bracket your value (zero), it does a poor job at interpolation. I don't think that they expected people to fit an unregularized model with this function.

You can get closer by expanding the range of penalty values using lambda.min.ratio:

library(ggplot2)
library(tidymodels)
#> ── Attaching packages ──────────────────────────────────────────────────────────────── tidymodels 0.1.1 ──
#> ✓ broom     0.7.0           ✓ recipes   0.1.13.9001
#> ✓ dials     0.0.9.9000      ✓ rsample   0.0.8      
#> ✓ dplyr     1.0.2           ✓ tibble    3.0.3      
#> ✓ infer     0.5.2           ✓ tidyr     1.1.2      
#> ✓ modeldata 0.0.2           ✓ tune      0.1.1.9000 
#> ✓ parsnip   0.1.3.9000      ✓ workflows 0.2.1.9000 
#> ✓ purrr     0.3.4           ✓ yardstick 0.0.7
#> ── Conflicts ─────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter()  masks stats::filter()
#> x dplyr::lag()     masks stats::lag()
#> x recipes::step()  masks stats::step()
library(glmnet)
#> Loading required package: Matrix
#> 
#> Attaching package: 'Matrix'
#> The following objects are masked from 'package:tidyr':
#> 
#>     expand, pack, unpack
#> Loaded glmnet 4.0-2

df <- ggplot2::mpg

x <- model.matrix(cty ~ hwy + displ, df)[, -1]
y <- df$cty


result1 <- glmnet(x, y, alpha=0, lambda=0)
tidy(result1)
#> # A tibble: 3 x 5
#>   term         step estimate lambda dev.ratio
#>   <chr>       <dbl>    <dbl>  <dbl>     <dbl>
#> 1 (Intercept)     1    4.73       0     0.924
#> 2 hwy             1    0.596      0     0.924
#> 3 displ           1   -0.528      0     0.924

result2 <- linear_reg(mixture=0, penalty = 0) %>%
  set_engine('glmnet') %>%
  fit(cty ~ hwy + displ, data=df)
summary(result2$fit$lambda)
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#>    0.406    4.063   40.640  456.932  406.297 4059.624

tidy(result2)
#> # A tibble: 3 x 3
#>   term        estimate penalty
#>   <chr>          <dbl>   <dbl>
#> 1 (Intercept)    7.62        0
#> 2 hwy            0.507       0
#> 3 displ         -0.766       0

result3 <- linear_reg(mixture=0, penalty = 0) %>%
  set_engine('glmnet', lambda.min.ratio = 1e-15) %>%
  fit(cty ~ hwy + displ, data=df)
summary(result3$fit$lambda)
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#>    0.019    0.406    8.746  350.197  188.431 4059.624

tidy(result3)
#> # A tibble: 3 x 3
#>   term        estimate penalty
#>   <chr>          <dbl>   <dbl>
#> 1 (Intercept)    4.93        0
#> 2 hwy            0.590       0
#> 3 displ         -0.545       0

Created on 2020-10-16 by the reprex package (v0.3.0)

1 Like