Best practices for working with RStudio Open Source on Databricks

I am looking for best practices for working with RStudio Open Source installed on a cluster on (Azure) Databricks. Locally I currently organize my R code in projects, which is best practice for a local machine. However, is this also the suggested way of working with R code on Databricks given that you want to use RStudio Open Source as the IDE instead of the Databricks notebook IDE?

Questions / Discussion points:

  • Should I organize my R code in projects?
  • Where should I save my R code?
  • Other tips on how to work with RStudio Open Source on Databricks? For example, what are the pros and cons of installing packages via the Databricks UI versus install.packages()?

This introductory video on how to use RStudio on Azure Databricks is somewhat useful, but it does not discuss the points that I have listed above.

In general, my impression is that Databricks provides much less practical information and code examples on how to use their platform with R compared to Python and Scala, which I think is a shame for the R community.

Any comments, suggestions, and links to resources are most welcome.

1 Like

I haven't used databricks, but I just wanted to point out that they have a forum as well, where you might have more luck, if no one around here seems to have the answers!
https://forums.databricks.com/index.html

1 Like

Hi mara, I have tried Databricks' forum, and it is very inactive in my experience, so that is why I am posting my question here because I know RStudio Community is an actively managed forum with RStudio employees responding on top of the community at large. For example, I'm sure the guys behind sparklyr have ideas and suggestions.

2 Likes

@samuel
RStudio is the best IDE for data science, databrick support a commerical spark jdk in RStudio.

  1. install.packages() is more reproducilble than UI click.
  2. R code can be orgainized very well in RStudio which integrated Git.
  3. R grammar is so elegant in data science field than other programming language, learning cost is pretty low. Such as dplyr - sparklyr, sf, geospark, postgis

Hi Samuel, I am also just getting started with RStudio on databricks but will be using it intensively for a project. My emerging working practice is:

  • When setting up RStudio don't save the one time password (as it is one time). If RStudio refuses to login and stubbornly sits there, close the browser down and login to Azure and databricks again. It's a pain but it works.

  • Run sparklyr locally when working up code using samples of the actual datasets. As you are charged when a cluster is running this saves on cost and also prevents weird errors such as lost connections when halting code on the cluster.

  • Assume that no R code that you write in Rstudio cluster will be saved. The cluster will start with a fresh RStudio instance each time (a good thing in terms of best practice). I imagine there is a way to write R documents to the data bricks file system but the clean instance each time seems better to me.

  • Set up a github repository containing an R project to import to your RStudio cluster. Follow chapter 15 of Jenny Bryan and Jim Hester's guide Happy Git and Githug for the userR here to set up an RStudio project linked to that repo. Then add your start up scripts etc. (my emerging version is here and nothing special). When the cluster version has started create a new project that uses version control and using your repo link (e.g. https://github.com/poldham/databricks.git). This will clone the repo into your RStudio cluster session with your scripts. If writing code you want to commit and push to the Github repo you will need to use your username and password from the Git shell in Rstudio.

That is where I have got to so far. The main challenge I have encountered is:

  • Importing data using R code (there are examples with SparkR here). When importing from an Azure blob I have so far found myself using Python to import to the File System (dbfs). Once in the file system sparklyr works fine from the RStudio side and it should be possible to figure out a better file import approach using R somewhere along the line.

For small files I find that Importing using Data > Add Data (below 2 Gb) and then from Notebook (python version) is easy. For large files, I tried the Databricks CLI to upload but it was epically slow. Instead it is better to upload to an Azure blob using the Explorer or direct to blob from Azure portal. When importing large files I used the Python code for the storage key setup etc. For downloading files the Databricks CLI worked very well.

Things to maybe explore

  • According to the documentation you can import R markdown into the databricks notebooks (not tried yet) so this may provide a future way to address what seem to be cluster side tasks at the moment.

These are not best practices, more just practices. However, hopefully RStudio users working with Databricks will start to write up some blog posts or guides to working with Databricks. However, I have really found it a great solution when working with 188 million records from Microsoft Academic Graph.

4 Likes

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.