sparklyr read from sql server database



I'm trying to import some data in a spark frame using sparklyr and I have seen it can easily be done with csv files with spark_read_csv
In my case I'm trying to read directly from a sql server database into a spark frame using spark_read_jdbc but I have no idea about how I should set it up and I can't find anything on the web.

Has anyone done it before?


1 Like


I can chime in with a Databricks solution which worked for me.

Install JDBC driver (Maven)

Make sure you are using a cluster with Microsoft SQLServer JDBC driver.

Note: to add this library to your own cluster, go to your cluster's settings, go to libraries, click 'Install New' select Maven library and use coordinate

Use stored secrets

First we use python to get the tokens and set them in spark conf:

Databirck uses the dbutils.secrets.get() function to pull secrets stored in the Databricks environment. This function is a python function. If you view the output of secret you will see '[REDACTED]' , this is not an error.

spark.conf.set("secret", secret) sets the tokens in spark conf

secret = dbutils.secrets.get(scope = "keyvault-allusers", key = "<your_key>")
spark.conf.set("secret", secret)

Set up R environment

If you want to take advantage of Spark in R you have two options SparkR and sparklyr. SparkR is analogous to Pyspark in that it uses spark-like functions while sparklyr uses dplyr , when possible, as functions. If a user is familiar with dplyr then sparklyr is reccomended.

# Installing latest version of Rcpp

# Installing sparklyr takes a few minutes, becauses it installs +10 dependencies.

# Load sparklyr package.

# Alternatively load the SparkR package

# Use for data manipulation

Fetch the secrets in spark.conf

Again, the output of secret is The secret key value will be replaced with [REDACTED] when it is displayed to an output cell. Secret redaction is an expected behavior to prevent printing the value to standard output. The secret key value will be replaced with [REDACTED] when it is displayed to an output cell.

sc <- sparklyr::spark_connect(method = "databricks", config = spark_config("secret"))

# This can probably also be done in sparklyr (future work)
secret <- SparkR::sparkR.conf("secret") 
df <- sparklyr::spark_read_jdbc(sc, 
                name ="my_table" , 
                options = list(url = "jdbc:sqlserver://<servername><port>;database=<databasename>", 
                               user = "<username>",
                               driver = "",
                               password = secret$secret,
                               dbtable = "<tablename>")) %>% 

Use dplyr commands to wrangle data

Sparklyr allows the use of dplyr while distributing across spark clusters. This makes the bridge from development to production smoother than SparkR since many of the same dplyr functions can be applied across clusters without needing to change syntax for Spark.

The display command is provided by Databricks. It makes pretty tables and allows quick plots for data via GUI.