sparklyr read from sql server database

dverona · November 11, 2018, 2:37pm

hello,

I'm trying to import some data in a spark frame using sparklyr and I have seen it can easily be done with csv files with spark_read_csv
In my case I'm trying to read directly from a sql server database into a spark frame using spark_read_jdbc but I have no idea about how I should set it up and I can't find anything on the web.

Has anyone done it before?

Thanks

alex628 · April 15, 2019, 9:09pm

I can chime in with a Databricks solution which worked for me.

Install JDBC driver (Maven)

Make sure you are using a cluster with Microsoft SQLServer JDBC driver.

Note: to add this library to your own cluster, go to your cluster's settings, go to libraries, click 'Install New' select Maven library and use coordinate com.microsoft.azure:azure-sqldb-spark:1.0.2

Use stored secrets

First we use python to get the tokens and set them in spark conf:

Databirck uses the dbutils.secrets.get() function to pull secrets stored in the Databricks environment. This function is a python function. If you view the output of secret you will see '[REDACTED]' , this is not an error.

spark.conf.set("secret", secret) sets the tokens in spark conf

%py
secret = dbutils.secrets.get(scope = "keyvault-allusers", key = "<your_key>")
spark.conf.set("secret", secret)

Set up R environment

If you want to take advantage of Spark in R you have two options SparkR and sparklyr. SparkR is analogous to Pyspark in that it uses spark-like functions while sparklyr uses dplyr , when possible, as functions. If a user is familiar with dplyr then sparklyr is reccomended.

%r
# Installing latest version of Rcpp
install.packages("Rcpp")

# Installing sparklyr takes a few minutes, becauses it installs +10 dependencies.
install.packages("sparklyr")

# Load sparklyr package.
library(sparklyr)

# Alternatively load the SparkR package
library(SparkR)

# Use for data manipulation
library(dplyr)

Fetch the secrets in spark.conf

Again, the output of secret is The secret key value will be replaced with [REDACTED] when it is displayed to an output cell. Secret redaction is an expected behavior to prevent printing the value to standard output. The secret key value will be replaced with [REDACTED] when it is displayed to an output cell.

%r
sc <- sparklyr::spark_connect(method = "databricks", config = spark_config("secret"))

# This can probably also be done in sparklyr (future work)
secret <- SparkR::sparkR.conf("secret") 
secret$secret

%r
df <- sparklyr::spark_read_jdbc(sc, 
                name ="my_table" , 
                options = list(url = "jdbc:sqlserver://<servername>.database.windows.net:<port>;database=<databasename>", 
                               user = "<username>",
                               driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver",
                               password = secret$secret,
                               dbtable = "<tablename>")) %>% 
       sparklyr::collect()

Use dplyr commands to wrangle data

Sparklyr allows the use of dplyr while distributing across spark clusters. This makes the bridge from development to production smoother than SparkR since many of the same dplyr functions can be applied across clusters without needing to change syntax for Spark.

The display command is provided by Databricks. It makes pretty tables and allows quick plots for data via GUI.

%r
display(df)

alex628 · August 19, 2019, 8:20pm

Just wanted to make an update on my post.

The sparklyr way to set a secret is

%r
conf <- spark_config()
conf$secret <- secret

This is an alternative to

%r
secret <- SparkR::sparkR.conf("secret") 
secret$secret

This is assuming that dbutils was set at the top of the notebook/script as follows

%py
secret = dbutils.secrets.get(scope = "some_keyvault", key = "a_key_with_DB_read_privileges")
spark.conf.set("secret", secret)

My experience is with with databricks only. Your mileage may vary.