Cross validation with zero-inflated or hurdle model

I apologize in advance for a more general how to question than a specific issue with a code chunk.

I have a dataset which is mostly zero's and I would like to make use of a hurdle or zero-inflated model. I've been looking at pscl package.

My goal is accuracy over inference so I was trying to figure out a way to do cross validation with the functions within pscl, e.g. hurdle().

Having already attempted and failed to use pscl::hurdle() as a custom function in caret, and having done much Google searching for a blog post or Stack Exchange post on the subject, I'm wondering if anyone out there has a repo or example of doing cross validation with a hurdle model.

I found it challenging because the hurdle model has two components, the binary classification part and then the regression part. So, how could one handle both of these within an iteration within a k fold?

If anyone out there has done this before, please share how you did so?! Also, if anyone has any package suggestions for this please let me know too. Someone at work mentioned tidy models but that package looks early in development so I doubt they'd have functionality for an edge case such as this.

Any pointers, tips or relevant code blocks most welcome.

I think the key part while doing cross-validation is to keep the same proportion
of zeros in the data across folds (i.e. stratified cross-validation), and then fit the hurdle model to each fold. In this example, I added a variable art_cat as an indicator variable on whether the observed response was a zero and resampled using this indicator variable to specify the groups.

``` r
library("pscl")
#> Classes and Methods for R developed in the
#> Political Science Computational Laboratory
#> Department of Political Science
#> Stanford University
#> Simon Jackman
#> hurdle and zeroinfl functions by Achim Zeileis
library("caret")
#> Loading required package: lattice
#> Loading required package: ggplot2
library("dplyr")
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library("purrr")
#> 
#> Attaching package: 'purrr'
#> The following object is masked from 'package:caret':
#> 
#>     lift
library("tidyr")

# Using sample datasets
data("bioChemists", package = "pscl")

# Create grouping based on whether reponse is 0.
bioChemists <- bioChemists %>%
  mutate(art_cat = ifelse(art == 0, TRUE, FALSE))

# Create stratified folds based on these groups
biochem_kfold <- createFolds(bioChemists[["art_cat"]], k = 10, list = TRUE)

# Use the list of row-indices to create data.frames for each fold
biochem_kdf <- purrr::map_dfr(biochem_kfold, ~bioChemists[.x, ],
  .id = "Fold") %>%
  select(-art_cat)

# Check proportion of zeros in each fold:

(check_balance <- biochem_kdf %>%
  group_by(Fold) %>%
  summarise(proportion_zero = mean(art == 0)))

#> # A tibble: 10 x 2
#>    Fold   proportion_zero
#>    <chr>            <dbl>
#>  1 Fold01           0.304
#>  2 Fold02           0.304
#>  3 Fold03           0.297
#>  4 Fold04           0.304
#>  5 Fold05           0.297
#>  6 Fold06           0.297
#>  7 Fold07           0.297
#>  8 Fold08           0.297
#>  9 Fold09           0.304
#> 10 Fold10           0.304


(original_balance <- bioChemists %>%
  summarise(proportion_zero = mean(art == 0)))
#>   proportion_zero
#> 1       0.3005464


# One approach using nested data.frames (tidymodels)
biochem_kdf <- biochem_kdf %>%
  group_by(Fold) %>%
  nest()

hurdle_model <- function(df) hurdle(art ~ ., data = df, dist = "negbin")

biochem_kdf <- biochem_kdf %>%
  mutate(mod_fit = map(data, hurdle_model)) %>% # Fit model
  mutate(mod_summary = map(mod_fit, ~summary(.x))) %>% # Store summary
  mutate(mod_observed = map(data, "art")) %>% # Extract observed responses
  mutate(mod_predicted = map(mod_fit, ~predict(.x))) %>% # Extract predicted responses
  mutate(rmse = map2_dbl(mod_observed, mod_predicted, ~RMSE(.x, .y))) # Calculate RMSE for each fold

# For each fold, extract summary for example
biochem_kdf[["mod_summary"]]
#> [[1]]
#> 
#> Call:
#> hurdle(formula = art ~ ., data = df, dist = "negbin")
#> 
#> Pearson residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1.4873 -0.7699 -0.1724  0.5927  3.3427 
#> 
#> Count model coefficients (truncated negbin with log link):
#>             Estimate Std. Error z value Pr(>|z|)   
#> (Intercept) -0.06495    0.67310  -0.096  0.92313   
#> femWomen     0.03801    0.29251   0.130  0.89662   
#> marMarried  -0.31786    0.32965  -0.964  0.33494   
#> kid5        -0.05288    0.22949  -0.230  0.81775   
#> phd          0.09239    0.15424   0.599  0.54919   
#> ment         0.04080    0.01416   2.880  0.00397 **
#> Log(theta)   1.26991    0.92054   1.380  0.16773   
#> Zero hurdle model coefficients (binomial with logit link):
#>             Estimate Std. Error z value Pr(>|z|)  
#> (Intercept)  2.08141    1.36882   1.521   0.1284  
#> femWomen    -0.01926    0.52122  -0.037   0.9705  
#> marMarried  -2.59290    1.08700  -2.385   0.0171 *
#> kid5        -0.21402    0.31804  -0.673   0.5010  
#> phd          0.23177    0.27015   0.858   0.3909  
#> ment         0.05934    0.03834   1.548   0.1217  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
#> 
#> Theta: count = 3.5605
#> Number of iterations in BFGS optimization: 17 
#> Log-likelihood:  -138 on 13 Df
#> 
#> [[2]] etc.

Created on 2019-12-04 by the reprex package (v0.3.0)

Thank you so much for this. I was really struggling here. I've not yet used tidy models but I'm going to take a dive in now.

As an aside, why would one wrap code within parenthesis?

e.g. above

(check_balance <- biochem_kdf %>%
  group_by(Fold) %>%
  summarise(proportion_zero = mean(art == 0)))

I see this sometimes out in the wild and do not understand why this is done?

Wrapping code in parentheses is the equivalent of using a print function. Compare

x <- 2

and

(x <- 2)

when executed in the console.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.