Use sparklyr package with Oracle database connection



Hi there,

I would like to get some explanations concerning the way to combine following R packages:
-odbc : used to connect to existing Oracle data source
-sparklyr : used to compute this data on a standalone Spark cluster

Here is what I have done :
-on my client computer, I used dbConnect() function from ODBC R package to connect to an existing Oracle database. This Oracle database is hosted on a windows server.

  • I separately implemented a Spark standalone cluster with some computers located on the same local network but isolated from the windows server: by using the master-url of this Spark cluster, I would like to use spark_connect() function of sparklyr package to connect my client computer ( which is connected to my Oracle data base ) to the Spark cluster.
    As a resume my objective consists to use the spark standalone cluster to execute parallel processing (e.g. ml_regression_trees) of data stored on my oracle data base.

So I would like to know if I necessarily need to store data of my Oracle database to my client computer in order to use it as an data.frame entry object for sparklyr functions. And in that case, how do I have to store the data to take benefit from the Spark cluster?

Otherwise, is there a way to do all of this directly with sparklyr package ? ( I mean: connection to Oracle database + big data processing with Spark )

More generally, does someone have any best solution to threat massive data stored on Oracle data base keeping an R interface?

Thank you very much for your help ( any advices are welcome!)


Hi, yes, the idea would be to pull the Oracle data directly into Spark using a JDBC connector. In sparklyr, you can use spark_read_jdbc to do that. @javierluraschi 's PR for that has an example on how to set it up: