I would liek to fit a hurdle model on caret framework and am reading the documentation page on adding your own model.
The formula input to a hurdle model is two part, the binary classification part then the regression part e.g.
y ~ x1 + x2 | x1 + x2 + x3
This will run a hurdle model with y as a function of x1 + x2
for the regression part, conditional on the classification part where y being zero or non zero is a function of x1 + x2 +x3
.
Example hurdle model call:
mod.hurdle.utility <- pscl::hurdle(
formula = spend_30d ~ d7_utility_sum + IOS + is_publisher_organic + is_publisher_facebook | d7_utility_sum + IOS + is_publisher_organic + is_publisher_facebook,
data = pdata_small,
dist = "negbin"
)
In this case the predictors for both the classification and regression models are the same, but, using caret, I would like to experiment by adding and dropping features from each part of the hurdle model.
Since a hurdle model is not part of caret out of the box packages (as far as I can see), I was able to get this far in building as custom model to use with caret:
library(caret)
library(pscl)
set.seed(123)
pdata_small = pdata %>% sample_n(100000)
## caret custom model for hurdle
pscl_hurdle <- list(type = c("Classification", "Regression"),
library = "pscl",
type = "hurdle",
loop = NULL,
# paramters distinct to the custom model
parameters = data.frame(parameter = c("dist", "zero.dist"),
class = rep("character", 2),
label = c("dist", "zero.dist")),
# required grid parameter
grid = function(x, y, len = NULL, search = "grid") {
if(search == "grid") {
expand.grid(dist = c("poisson", "geometric", "negbin"),
zero.dist = c("poisson", "geometric", "negbin", "binomial")) %>%
mutate_all(as.character)}
else {
data.frame(
dist = "poisson",
zero.dist = "binomial"
)
}
},
# define the fit function
fit = function(x, y, wts, param, lev, last, weights, classProbs, ...) {
pscl::hurdle(
dist = param$dist,
zero.dist = param$zero.dist,
formula =
)
},
# figure these out after figuring out the formula part
predict = NULL,
prob = NULL
)
# create train control object for using the same 10 folds across all models
train_control <- trainControl(
method = "cv",
number = 3, # change to 10 when production ready
savePredictions = "final",
verboseIter = T,
allowParallel = T
)
mod.hurdle.utility <- train(
x = pdata_small %>% select(d7_utility_sum, spend_7d, IOS, is_publisher_facebook, is_publisher_organic),
y = pdata_small$spend_30d,
method = pscl_hurdle,
trControl = train_control
)
within pscl_hurdle$fit
I need to somehow be able to take inputs from train()
and construct something of the form y ~ x1 + x2 | x1 + x2 + x3
.
Ideal world example:
mod.hurdle.utility <- train(
count_model_formula = spend_30d ~ d7_utility_sum + IOS + is_publisher_organic + is_publisher_facebook,
hurdle_model_formula = d7_utility_sum + IOS + is_publisher_organic + is_publisher_facebook,
method = pscl_hurdle,
trControl = train_control
)
If I could take those custom inputs, count_model_formula
and hurdle_model_formula
and then handle them in pscl_hurdle$fit
to read formula = spend_30d ~ d7_utility_sum + IOS + is_publisher_organic + is_publisher_facebook | d7_utility_sum + IOS + is_publisher_organic + is_publisher_facebook
I think that might work, but I'm not sure since I've never done this before. I also doubt caret::train() would run without a formula or x, y parameters?
Is what I'm trying to do possible? Am I on the right track? Is there a better way?
How can I integrate a hurdle model into caret? More specifically, how can I generate a custom formula within pscl_hurdle$fit
of the form y ~ x1 + x2 | x1 + x2 + x3
based on the inputs when calling train()?