I'm trying to apply a random forest model to highly unbalanced data set in spark by using sparkllyr.
Does anyone know how to apply 'smote' method in sparklyr?
Or do you have any suggestions to deal with unbalanced datasets in spark?
There seem to be two implementations in Spark,
- SMOTE-MR: Distributed Synthetic Minority Oversampling Technique.
- SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data.
SMOTE-MR is an approximation an I was not able to find the sources,
SMOTE-BD is available under majobasgall/smote-bd and licensed under Apache 2.0. There are a few options to run
SMOTE-MR depending on what you find yourself more comfortable working with:
spark-submit following the instructions in the GitHub repo, input files would have to be saved with
save.keel() from SDR v0.7.0.0.
b) Create an
sparklyr extension in Scala that wraps the
SMOTE-BD source code and makes it easily available to R users.
c) Reinterpret the algorithm in
sparklyr, the first algorithm referenced in the publication can probably be implemented using only
dplyr, the second part (creating the synthetic data) would most likely need to be implemented with
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.
If you have a query related to it or one of the replies, start a new topic and refer back with a link.