Goal
Use the tidymodels-framework to implement LASSO in with nested cross-validation (CV). (Alternatively, I'd also be interested in an implementation with caret but at this point tidymodels is prefered.)
- Make LASSO work with bootstrap resampling
- Replace bootstrap resampling with nested CV (inner CV (or bootstrapping) for hyperparameter determination by going through a grid, outer CV to obtain an estimate on the validity of the models)
Approach
- I tried to use the code by Julia Silge without nested CV but bootstrapping
- Add nested CV functionality by replacing resamples with a nested-CV routine.
Issue
Using the code below throws me an error probably caused by the fail of LASSO with bootstrapping.
Code
# Package imports ------
library(readr)
library(tidymodels)
# Data ------
# Prepared according to the Blog post by Julia Silge
# https://juliasilge.com/blog/lasso-the-office/
urlfile = 'https://raw.githubusercontent.com/shudras/office_data/master/office_data.csv'
office = read_csv(url(urlfile))[-1]
#> Warning: Missing column names filled in: 'X1' [1]
#> Parsed with column specification:
#> cols(
#> .default = col_double()
#> )
#> See spec(...) for full column specifications.
# Lasso modeling -------
## Recipe and train it
office_rec <- recipe(imdb_rating ~ ., data = office) %>%
step_zv(all_numeric(), -all_outcomes()) %>%
step_normalize(all_numeric(), -all_outcomes()) %>%
prep(strings_as_factors = FALSE) # Training
## Create workflow
wf <- workflow() %>%
add_recipe(office_rec)
## Parameter tuning
set.seed(4653)
### Bootstrapping data for resampling
office_boot <- bootstraps(office, times = 5, strata = season)
### Create lambda seach gird
lambda_grid <- grid_regular(penalty(), levels = 20)
### The model
tune_spec <- linear_reg(penalty = tune(), mixture = 1) %>%
set_engine("glmnet")
### Apply the workflow
lasso_grid <- tune_grid(
wf %>% add_model(tune_spec),
resamples = office_boot,
grid = lambda_grid
)
#> ! Bootstrap1: internal: Standardabweichung ist Null
#> ! Bootstrap2: internal: Standardabweichung ist Null
#> ! Bootstrap3: internal: Standardabweichung ist Null
#> ! Bootstrap4: internal: Standardabweichung ist Null
#> ! Bootstrap5: internal: Standardabweichung ist Null
#> Error: `x` and `y` must have same types and lengths
Created on 2020-06-10 by the reprex package (v0.3.0)
Edit 1: package versions
# Package imports ------
library(readr)
library(tidymodels)
#> ── Attaching packages ─────────────────────────── tidymodels 0.1.0 ──
#> ✓ broom 0.5.6 ✓ recipes 0.1.12
#> ✓ dials 0.0.6 ✓ rsample 0.0.7
#> ✓ dplyr 0.8.5 ✓ tibble 3.0.1
#> ✓ ggplot2 3.3.0 ✓ tune 0.1.0
#> ✓ infer 0.5.1 ✓ workflows 0.1.1
#> ✓ parsnip 0.1.1 ✓ yardstick 0.0.6
#> ✓ purrr 0.3.4
#> ── Conflicts ────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
#> x ggplot2::margin() masks dials::margin()
#> x yardstick::spec() masks readr::spec()
#> x recipes::step() masks stats::step()
Edit 2: package versions
library(readr)
library(tidymodels)
#> ── Attaching packages ───────── tidymodels 0.1.0 ──
#> ✓ broom 0.5.6 ✓ recipes 0.1.12
#> ✓ dials 0.0.6 ✓ rsample 0.0.7
#> ✓ dplyr 1.0.0 ✓ tibble 3.0.1
#> ✓ ggplot2 3.3.1 ✓ tune 0.1.0
#> ✓ infer 0.5.1 ✓ workflows 0.1.1
#> ✓ parsnip 0.1.1 ✓ yardstick 0.0.6
#> ✓ purrr 0.3.4
#> ── Conflicts ──────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
#> x ggplot2::margin() masks dials::margin()
#> x yardstick::spec() masks readr::spec()
#> x recipes::step() masks stats::step()
library(reprex)