I'm trying to fit an ensemble model with a sparklyr pipeline, ideally with cross validation, and i'm hitting a roadblock & can't really find any useful info on the internet.
Let's say I want to fit a rf & a logistic to a dataset, then use the prediction to train a gbm (ideally, if I had more models, a step in between to select variable based on correlation between models), then a cv to select the best models. Could this be done with a pipeline?
I tried the following, but the pipeline breaks because i have a transformer after an unfit estimator
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE) # Create a pipeline pipeline <- ml_pipeline(sc) %>% ft_r_formula(Species ~ . ) %>% ml_random_forest_classifier(prediction_col = "prediction_rf", probability_col = "probability_rf", raw_prediction_col = "rawPrediction_rf", uid = "random_forest") %>% ml_logistic_regression(prediction_col = "prediction_log", probability_col = "probability_log", raw_prediction_col = "rawPrediction_log") %>% ft_r_formula(Species ~ probability_rf + probability_log) %>% ml_gbt_classifier() # breaks here because estimator -> transformer # Specify hyperparameter grid grid <- list( random_forest = list( num_trees = c(5,10), max_depth = c(5,10), impurity = c("entropy", "gini") ), logistic = list( elastic_net_param = seq(0, 1, 0.1) ), gbt = list( max_iter = c(20, 40), max_depth = c(3, 5) ) ) # Create the cross validator object cv <- ml_cross_validator( sc, estimator = pipeline, estimator_param_maps = grid, evaluator = ml_multiclass_classification_evaluator(sc), num_folds = 3, parallelism = 4 ) # Train the models cv_model <- ml_fit(cv, iris_tbl)
Is it feasible via a pipeline or do I have to specify the steps separately and then join the prediction?