I think the key part while doing cross-validation is to keep the same proportion
of zeros in the data across folds (i.e. stratified cross-validation), and then fit the hurdle model to each fold. In this example, I added a variable art_cat as an indicator variable on whether the observed response was a zero and resampled using this indicator variable to specify the groups.
``` r
library("pscl")
#> Classes and Methods for R developed in the
#> Political Science Computational Laboratory
#> Department of Political Science
#> Stanford University
#> Simon Jackman
#> hurdle and zeroinfl functions by Achim Zeileis
library("caret")
#> Loading required package: lattice
#> Loading required package: ggplot2
library("dplyr")
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library("purrr")
#>
#> Attaching package: 'purrr'
#> The following object is masked from 'package:caret':
#>
#> lift
library("tidyr")
# Using sample datasets
data("bioChemists", package = "pscl")
# Create grouping based on whether reponse is 0.
bioChemists <- bioChemists %>%
mutate(art_cat = ifelse(art == 0, TRUE, FALSE))
# Create stratified folds based on these groups
biochem_kfold <- createFolds(bioChemists[["art_cat"]], k = 10, list = TRUE)
# Use the list of row-indices to create data.frames for each fold
biochem_kdf <- purrr::map_dfr(biochem_kfold, ~bioChemists[.x, ],
.id = "Fold") %>%
select(-art_cat)
# Check proportion of zeros in each fold:
(check_balance <- biochem_kdf %>%
group_by(Fold) %>%
summarise(proportion_zero = mean(art == 0)))
#> # A tibble: 10 x 2
#> Fold proportion_zero
#> <chr> <dbl>
#> 1 Fold01 0.304
#> 2 Fold02 0.304
#> 3 Fold03 0.297
#> 4 Fold04 0.304
#> 5 Fold05 0.297
#> 6 Fold06 0.297
#> 7 Fold07 0.297
#> 8 Fold08 0.297
#> 9 Fold09 0.304
#> 10 Fold10 0.304
(original_balance <- bioChemists %>%
summarise(proportion_zero = mean(art == 0)))
#> proportion_zero
#> 1 0.3005464
# One approach using nested data.frames (tidymodels)
biochem_kdf <- biochem_kdf %>%
group_by(Fold) %>%
nest()
hurdle_model <- function(df) hurdle(art ~ ., data = df, dist = "negbin")
biochem_kdf <- biochem_kdf %>%
mutate(mod_fit = map(data, hurdle_model)) %>% # Fit model
mutate(mod_summary = map(mod_fit, ~summary(.x))) %>% # Store summary
mutate(mod_observed = map(data, "art")) %>% # Extract observed responses
mutate(mod_predicted = map(mod_fit, ~predict(.x))) %>% # Extract predicted responses
mutate(rmse = map2_dbl(mod_observed, mod_predicted, ~RMSE(.x, .y))) # Calculate RMSE for each fold
# For each fold, extract summary for example
biochem_kdf[["mod_summary"]]
#> [[1]]
#>
#> Call:
#> hurdle(formula = art ~ ., data = df, dist = "negbin")
#>
#> Pearson residuals:
#> Min 1Q Median 3Q Max
#> -1.4873 -0.7699 -0.1724 0.5927 3.3427
#>
#> Count model coefficients (truncated negbin with log link):
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.06495 0.67310 -0.096 0.92313
#> femWomen 0.03801 0.29251 0.130 0.89662
#> marMarried -0.31786 0.32965 -0.964 0.33494
#> kid5 -0.05288 0.22949 -0.230 0.81775
#> phd 0.09239 0.15424 0.599 0.54919
#> ment 0.04080 0.01416 2.880 0.00397 **
#> Log(theta) 1.26991 0.92054 1.380 0.16773
#> Zero hurdle model coefficients (binomial with logit link):
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 2.08141 1.36882 1.521 0.1284
#> femWomen -0.01926 0.52122 -0.037 0.9705
#> marMarried -2.59290 1.08700 -2.385 0.0171 *
#> kid5 -0.21402 0.31804 -0.673 0.5010
#> phd 0.23177 0.27015 0.858 0.3909
#> ment 0.05934 0.03834 1.548 0.1217
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Theta: count = 3.5605
#> Number of iterations in BFGS optimization: 17
#> Log-likelihood: -138 on 13 Df
#>
#> [[2]] etc.
Created on 2019-12-04 by the reprex package (v0.3.0)