Hello,
I'm trying to fit an ensemble model with a sparklyr pipeline, ideally with cross validation, and i'm hitting a roadblock & can't really find any useful info on the internet.
Let's say I want to fit a rf & a logistic to a dataset, then use the prediction to train a gbm (ideally, if I had more models, a step in between to select variable based on correlation between models), then a cv to select the best models. Could this be done with a pipeline?
I tried the following, but the pipeline breaks because i have a transformer after an unfit estimator
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
# Create a pipeline
pipeline <- ml_pipeline(sc) %>%
ft_r_formula(Species ~ . ) %>%
ml_random_forest_classifier(prediction_col = "prediction_rf",
probability_col = "probability_rf",
raw_prediction_col = "rawPrediction_rf",
uid = "random_forest") %>%
ml_logistic_regression(prediction_col = "prediction_log",
probability_col = "probability_log",
raw_prediction_col = "rawPrediction_log") %>%
ft_r_formula(Species ~ probability_rf + probability_log) %>%
ml_gbt_classifier()
# breaks here because estimator -> transformer
# Specify hyperparameter grid
grid <- list(
random_forest = list(
num_trees = c(5,10),
max_depth = c(5,10),
impurity = c("entropy", "gini")
),
logistic = list(
elastic_net_param = seq(0, 1, 0.1)
),
gbt = list(
max_iter = c(20, 40),
max_depth = c(3, 5)
)
)
# Create the cross validator object
cv <- ml_cross_validator(
sc,
estimator = pipeline,
estimator_param_maps = grid,
evaluator = ml_multiclass_classification_evaluator(sc),
num_folds = 3,
parallelism = 4
)
# Train the models
cv_model <- ml_fit(cv, iris_tbl)
Is it feasible via a pipeline or do I have to specify the steps separately and then join the prediction?
Thank you!