How to pass pscl::hurdle() two part formula input to caret?

I would liek to fit a hurdle model on caret framework and am reading the documentation page on adding your own model.

The formula input to a hurdle model is two part, the binary classification part then the regression part e.g.

y ~ x1 + x2 | x1 + x2 + x3

This will run a hurdle model with y as a function of x1 + x2 for the regression part, conditional on the classification part where y being zero or non zero is a function of x1 + x2 +x3.

Example hurdle model call:

mod.hurdle.utility <- pscl::hurdle(
  formula = spend_30d ~ d7_utility_sum + IOS + is_publisher_organic + is_publisher_facebook | d7_utility_sum + IOS + is_publisher_organic + is_publisher_facebook,
  data = pdata_small,
  dist = "negbin"
)

In this case the predictors for both the classification and regression models are the same, but, using caret, I would like to experiment by adding and dropping features from each part of the hurdle model.

Since a hurdle model is not part of caret out of the box packages (as far as I can see), I was able to get this far in building as custom model to use with caret:

library(caret)
library(pscl)
set.seed(123)
pdata_small = pdata %>% sample_n(100000)


## caret custom model for hurdle
pscl_hurdle <- list(type = c("Classification", "Regression"),
                    library = "pscl",
                    type = "hurdle",
                    loop = NULL,
                    
                    # paramters distinct to the custom model
                    parameters = data.frame(parameter = c("dist", "zero.dist"),
                                            class = rep("character", 2),
                                            label = c("dist", "zero.dist")),
                    
                    # required grid parameter
                    grid = function(x, y, len = NULL, search = "grid") {
                      
                      if(search == "grid") {
                      
                      expand.grid(dist = c("poisson", "geometric", "negbin"),
                                       zero.dist = c("poisson", "geometric", "negbin", "binomial")) %>% 
                      mutate_all(as.character)}
                      
                      else {
                        data.frame(
                          dist = "poisson",
                          zero.dist = "binomial"
                        )
                      }
                      
                      },
                    
                    # define the fit function
                    fit = function(x, y, wts, param, lev, last, weights, classProbs, ...) {
                      pscl::hurdle(
                        dist = param$dist,
                        zero.dist = param$zero.dist,
                        formula = 
                        
                      )
                    },
                    
                    # figure these out after figuring out the formula part
                    predict = NULL,
                    prob = NULL
                    )

# create train control object for using the same 10 folds across all models
train_control <- trainControl(
  method = "cv",
  number = 3, # change to 10 when production ready
  savePredictions = "final",
  verboseIter = T,
  allowParallel = T
)


mod.hurdle.utility <- train(
  x = pdata_small %>% select(d7_utility_sum, spend_7d, IOS, is_publisher_facebook, is_publisher_organic),
  y = pdata_small$spend_30d,
  method = pscl_hurdle,
  trControl = train_control
)

within pscl_hurdle$fit I need to somehow be able to take inputs from train() and construct something of the form y ~ x1 + x2 | x1 + x2 + x3.

Ideal world example:

mod.hurdle.utility <- train(
  count_model_formula = spend_30d ~ d7_utility_sum + IOS + is_publisher_organic + is_publisher_facebook,
  hurdle_model_formula = d7_utility_sum + IOS + is_publisher_organic + is_publisher_facebook,
  method = pscl_hurdle,
  trControl = train_control
)

If I could take those custom inputs, count_model_formula and hurdle_model_formula and then handle them in pscl_hurdle$fit to read formula = spend_30d ~ d7_utility_sum + IOS + is_publisher_organic + is_publisher_facebook | d7_utility_sum + IOS + is_publisher_organic + is_publisher_facebook I think that might work, but I'm not sure since I've never done this before. I also doubt caret::train() would run without a formula or x, y parameters?

Is what I'm trying to do possible? Am I on the right track? Is there a better way?

How can I integrate a hurdle model into caret? More specifically, how can I generate a custom formula within pscl_hurdle$fit of the form y ~ x1 + x2 | x1 + x2 + x3 based on the inputs when calling train()?

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.