I want to tune a random forest model (with ranger) using the tidymodels framework. Because it takes too long locally, I'm trying to make it work on Azure Databricks.
-
I first called the tune_grid() function in the R notebook with a tiny grid and it worked (I verified the output) but without parallelization.
-
Next, I loaded the sparklyr package and called the following code, before trying the same tune_grid() call again:
sc <- spark_connect(method = 'databricks')
registerDoSpark(sc)
According to this blog post, this method should work.
However, while the Spark Jobs pop up in the notebook, they finish quickly with lots of 'skipped stages'. There is no error, however.
When I run the resulting object, I get:
# Tuning results
# Validation Set Split (0.75/0.25)
# A tibble: 1 x 4
splits id .metrics .notes
<list> <chr> <list> <list>
1 <split [66222/22074]> validation <NULL> <tibble [0 × 1]>
Does anyone have any idea why this happens or how to diagnose the problem? Is there a better way of doing tuning with tidymodels on Azure Databricks (I'm a complete novice when it comes to cluster computing). I have seen several other options for doing machine learning on Azure Databricks, but if possible I'd prefer to stick with tidymodels as I like the framework and to keep using essentially the same code as on my laptop.
Thanks!