LASSO with tidymodels not working

Goal

Use the tidymodels-framework to implement LASSO in with nested cross-validation (CV). (Alternatively, I'd also be interested in an implementation with caret but at this point tidymodels is prefered.)

  • Make LASSO work with bootstrap resampling
  • Replace bootstrap resampling with nested CV (inner CV (or bootstrapping) for hyperparameter determination by going through a grid, outer CV to obtain an estimate on the validity of the models)

Approach

  1. I tried to use the code by Julia Silge without nested CV but bootstrapping
  2. Add nested CV functionality by replacing resamples with a nested-CV routine.

Issue

Using the code below throws me an error probably caused by the fail of LASSO with bootstrapping.

Code

# Package imports ------
library(readr)
library(tidymodels)

# Data ------
# Prepared according to the Blog post by Julia Silge
# https://juliasilge.com/blog/lasso-the-office/
urlfile = 'https://raw.githubusercontent.com/shudras/office_data/master/office_data.csv'
office = read_csv(url(urlfile))[-1]
#> Warning: Missing column names filled in: 'X1' [1]
#> Parsed with column specification:
#> cols(
#>   .default = col_double()
#> )
#> See spec(...) for full column specifications.

# Lasso modeling -------
## Recipe and train it 
office_rec <- recipe(imdb_rating ~ ., data = office) %>%
  step_zv(all_numeric(), -all_outcomes()) %>%
  step_normalize(all_numeric(), -all_outcomes()) %>%
  prep(strings_as_factors = FALSE) # Training

## Create workflow 
wf <- workflow() %>%
  add_recipe(office_rec)

## Parameter tuning 
set.seed(4653)
### Bootstrapping data for resampling
office_boot <- bootstraps(office, times = 5, strata = season)

### Create lambda seach gird
lambda_grid <- grid_regular(penalty(), levels = 20)

### The model
tune_spec <- linear_reg(penalty = tune(), mixture = 1) %>%
  set_engine("glmnet")

### Apply the workflow
lasso_grid <- tune_grid(
  wf %>% add_model(tune_spec),
  resamples = office_boot,
  grid = lambda_grid
)
#> ! Bootstrap1: internal: Standardabweichung ist Null
#> ! Bootstrap2: internal: Standardabweichung ist Null
#> ! Bootstrap3: internal: Standardabweichung ist Null
#> ! Bootstrap4: internal: Standardabweichung ist Null
#> ! Bootstrap5: internal: Standardabweichung ist Null
#> Error: `x` and `y` must have same types and lengths

Created on 2020-06-10 by the reprex package (v0.3.0)

Edit 1: package versions

# Package imports ------
library(readr)
library(tidymodels)
#> ── Attaching packages ─────────────────────────── tidymodels 0.1.0 ──
#> ✓ broom     0.5.6      ✓ recipes   0.1.12
#> ✓ dials     0.0.6      ✓ rsample   0.0.7 
#> ✓ dplyr     0.8.5      ✓ tibble    3.0.1 
#> ✓ ggplot2   3.3.0      ✓ tune      0.1.0 
#> ✓ infer     0.5.1      ✓ workflows 0.1.1 
#> ✓ parsnip   0.1.1      ✓ yardstick 0.0.6 
#> ✓ purrr     0.3.4
#> ── Conflicts ────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard()  masks scales::discard()
#> x dplyr::filter()   masks stats::filter()
#> x dplyr::lag()      masks stats::lag()
#> x ggplot2::margin() masks dials::margin()
#> x yardstick::spec() masks readr::spec()
#> x recipes::step()   masks stats::step()

Edit 2: package versions

library(readr)
library(tidymodels)
#> ── Attaching packages ───────── tidymodels 0.1.0 ──
#> ✓ broom     0.5.6      ✓ recipes   0.1.12
#> ✓ dials     0.0.6      ✓ rsample   0.0.7 
#> ✓ dplyr     1.0.0      ✓ tibble    3.0.1 
#> ✓ ggplot2   3.3.1      ✓ tune      0.1.0 
#> ✓ infer     0.5.1      ✓ workflows 0.1.1 
#> ✓ parsnip   0.1.1      ✓ yardstick 0.0.6 
#> ✓ purrr     0.3.4
#> ── Conflicts ──────────── tidymodels_conflicts() ──
#> x purrr::discard()  masks scales::discard()
#> x dplyr::filter()   masks stats::filter()
#> x dplyr::lag()      masks stats::lag()
#> x ggplot2::margin() masks dials::margin()
#> x yardstick::spec() masks readr::spec()
#> x recipes::step()   masks stats::step()
library(reprex)

Can you list the versions that you used? With CRAN versions, I do not get errors. The warning shown below is from very high penalty values that eliminate all of the predictors.

library(readr)
library(tidymodels)
#> ── Attaching packages ────────────────────────────────────────────────────────────────────────────── tidymodels 0.1.0 ──
#> ✓ broom     0.5.6      ✓ recipes   0.1.12
#> ✓ dials     0.0.6      ✓ rsample   0.0.7 
#> ✓ dplyr     1.0.0      ✓ tibble    3.0.1 
#> ✓ ggplot2   3.3.1      ✓ tune      0.1.0 
#> ✓ infer     0.5.1      ✓ workflows 0.1.1 
#> ✓ parsnip   0.1.1      ✓ yardstick 0.0.6 
#> ✓ purrr     0.3.4
#> ── Conflicts ───────────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard()  masks scales::discard()
#> x dplyr::filter()   masks stats::filter()
#> x dplyr::lag()      masks stats::lag()
#> x ggplot2::margin() masks dials::margin()
#> x yardstick::spec() masks readr::spec()
#> x recipes::step()   masks stats::step()

# Data ------
# Prepared according to the Blog post by Julia Silge
# https://juliasilge.com/blog/lasso-the-office/
urlfile = 'https://raw.githubusercontent.com/shudras/office_data/master/office_data.csv'
office = read_csv(url(urlfile))[-1]
#> Warning: Missing column names filled in: 'X1' [1]
#> Parsed with column specification:
#> cols(
#>   .default = col_double()
#> )
#> See spec(...) for full column specifications.


# Lasso modeling -------
## Recipe and train it 
office_rec <- recipe(imdb_rating ~ ., data = office) %>%
  step_zv(all_numeric(), -all_outcomes()) %>%
  step_normalize(all_numeric(), -all_outcomes()) %>%
  prep(strings_as_factors = FALSE) # Training

## Create workflow 
wf <- workflow() %>%
  add_recipe(office_rec)

## Parameter tuning 
set.seed(4653)
### Bootstrapping data for resampling
office_boot <- bootstraps(office, times = 5, strata = season)

### Create lambda seach gird
lambda_grid <- grid_regular(penalty(), levels = 20)

### The model
tune_spec <- linear_reg(penalty = tune(), mixture = 1) %>%
  set_engine("glmnet")

### Apply the workflow
lasso_grid <- tune_grid(
  wf %>% add_model(tune_spec),
  resamples = office_boot,
  grid = lambda_grid
)
#> ! Bootstrap1: internal: A correlation computation is required, but `estimate` is const...
#> ! Bootstrap2: internal: A correlation computation is required, but `estimate` is const...
#> ! Bootstrap3: internal: A correlation computation is required, but `estimate` is const...
#> ! Bootstrap4: internal: A correlation computation is required, but `estimate` is const...
#> ! Bootstrap5: internal: A correlation computation is required, but `estimate` is const...

Created on 2020-06-10 by the reprex package (v0.3.0)

We don't currently support nested resampling in tune.

I edited the post and appended the versions of the loaded packages. The versions I'm working with should be relatively recent. I think, your dplyr and ggplot2 version are a bit newer.

I can reproduce it now. It looks like you'll have to downgrade to rsample 0.0.6 or upgrade dplyr to 1.0.0.

I upgraded dplyr to 1.0.0 and also the ggplot2 so that I now have the same package versions as you do (Edit 2). The error, x and y must have same types and lengths, is avoided now. Thank you. In Bootstrapping, internal standard deviations are NULL, still. When I use a fixed lamba value, I obtain results but not when using lambda_grid or bootstraps.
Using caret, and a grid defined by myself, LASSO works, too.

I read so, but was hoping to find a work-around or actually find a way to compare the individual lamda coefficients of the variables instead of solely the (RMSE) of lambda the values.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.