I can chime in with a Databricks solution which worked for me.
Install JDBC driver (Maven)
Make sure you are using a cluster with Microsoft SQLServer JDBC driver.
Note: to add this library to your own cluster, go to your cluster's settings, go to libraries, click 'Install New' select Maven library and use coordinate com.microsoft.azure:azure-sqldb-spark:1.0.2
Use stored secrets
First we use python to get the tokens and set them in spark conf:
Databirck uses the dbutils.secrets.get()
function to pull secrets stored in the Databricks environment. This function is a python function. If you view the output of secret
you will see '[REDACTED]'
, this is not an error.
spark.conf.set("secret", secret)
sets the tokens in spark conf
%py
secret = dbutils.secrets.get(scope = "keyvault-allusers", key = "<your_key>")
spark.conf.set("secret", secret)
Set up R environment
If you want to take advantage of Spark in R you have two options SparkR and sparklyr. SparkR
is analogous to Pyspark in that it uses spark-like functions while sparklyr
uses dplyr
, when possible, as functions. If a user is familiar with dplyr
then sparklyr
is reccomended.
%r
# Installing latest version of Rcpp
install.packages("Rcpp")
# Installing sparklyr takes a few minutes, becauses it installs +10 dependencies.
install.packages("sparklyr")
# Load sparklyr package.
library(sparklyr)
# Alternatively load the SparkR package
library(SparkR)
# Use for data manipulation
library(dplyr)
Fetch the secrets in spark.conf
Again, the output of secret
is The secret key value will be replaced with [REDACTED]
when it is displayed to an output cell. Secret redaction is an expected behavior to prevent printing the value to standard output. The secret key value will be replaced with [REDACTED]
when it is displayed to an output cell.
%r
sc <- sparklyr::spark_connect(method = "databricks", config = spark_config("secret"))
# This can probably also be done in sparklyr (future work)
secret <- SparkR::sparkR.conf("secret")
secret$secret
%r
df <- sparklyr::spark_read_jdbc(sc,
name ="my_table" ,
options = list(url = "jdbc:sqlserver://<servername>.database.windows.net:<port>;database=<databasename>",
user = "<username>",
driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver",
password = secret$secret,
dbtable = "<tablename>")) %>%
sparklyr::collect()
Use dplyr commands to wrangle data
Sparklyr allows the use of dplyr while distributing across spark clusters. This makes the bridge from development to production smoother than SparkR since many of the same dplyr functions can be applied across clusters without needing to change syntax for Spark.
The display
command is provided by Databricks. It makes pretty tables and allows quick plots for data via GUI.
%r
display(df)