Smote in Sparklyr


Hi everyone.
I'm trying to apply a random forest model to highly unbalanced data set in spark by using sparkllyr.
Does anyone know how to apply 'smote' method in sparklyr?
Or do you have any suggestions to deal with unbalanced datasets in spark?


There seem to be two implementations in Spark,

SMOTE-MR is an approximation an I was not able to find the sources, SMOTE-BD is available under majobasgall/smote-bd and licensed under Apache 2.0. There are a few options to run SMOTE-MR depending on what you find yourself more comfortable working with:

a) Run SMOTE-BD using spark-submit following the instructions in the GitHub repo, input files would have to be saved with save.keel() from SDR v0.7.0.0.

b) Create an sparklyr extension in Scala that wraps the SMOTE-BD source code and makes it easily available to R users.

c) Reinterpret the algorithm in sparklyr, the first algorithm referenced in the publication can probably be implemented using only dplyr, the second part (creating the synthetic data) would most likely need to be implemented with spark_apply().

closed #3

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.