Ensemble model in Sparklyr

Hello,

I'm trying to fit an ensemble model with a sparklyr pipeline, ideally with cross validation, and i'm hitting a roadblock & can't really find any useful info on the internet.

Let's say I want to fit a rf & a logistic to a dataset, then use the prediction to train a gbm (ideally, if I had more models, a step in between to select variable based on correlation between models), then a cv to select the best models. Could this be done with a pipeline?

I tried the following, but the pipeline breaks because i have a transformer after an unfit estimator

iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)

# Create a pipeline
pipeline <- ml_pipeline(sc) %>%
  ft_r_formula(Species ~ . ) %>%
  ml_random_forest_classifier(prediction_col = "prediction_rf",
                              probability_col = "probability_rf",
                              raw_prediction_col = "rawPrediction_rf",
                              uid = "random_forest") %>%
  ml_logistic_regression(prediction_col = "prediction_log",
                         probability_col = "probability_log",
                         raw_prediction_col = "rawPrediction_log") %>%
  ft_r_formula(Species ~ probability_rf + probability_log) %>%
  ml_gbt_classifier()
# breaks here because estimator -> transformer

# Specify hyperparameter grid
grid <- list(
  random_forest = list(
    num_trees = c(5,10),
    max_depth = c(5,10),
    impurity = c("entropy", "gini")
  ),
  logistic = list(
    elastic_net_param = seq(0, 1, 0.1)
  ),
  gbt = list(
    max_iter = c(20, 40),
    max_depth = c(3, 5)
  )
)

# Create the cross validator object
cv <- ml_cross_validator(
  sc, 
  estimator = pipeline, 
  estimator_param_maps = grid,
  evaluator = ml_multiclass_classification_evaluator(sc),
  num_folds = 3,
  parallelism = 4
)

# Train the models
cv_model <- ml_fit(cv, iris_tbl)

Is it feasible via a pipeline or do I have to specify the steps separately and then join the prediction?
Thank you!

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.