I can chime in with a Databricks solution which worked for me.
Install JDBC driver (Maven)
Make sure you are using a cluster with Microsoft SQLServer JDBC driver.
Note: to add this library to your own cluster, go to your cluster's settings, go to libraries, click 'Install New' select Maven library and use coordinate
Use stored secrets
First we use python to get the tokens and set them in spark conf:
Databirck uses the
dbutils.secrets.get() function to pull secrets stored in the Databricks environment. This function is a python function. If you view the output of
secret you will see
'[REDACTED]' , this is not an error.
spark.conf.set("secret", secret) sets the tokens in spark conf
secret = dbutils.secrets.get(scope = "keyvault-allusers", key = "<your_key>")
Set up R environment
If you want to take advantage of Spark in R you have two options SparkR and sparklyr.
SparkR is analogous to Pyspark in that it uses spark-like functions while
dplyr , when possible, as functions. If a user is familiar with
sparklyr is reccomended.
# Installing latest version of Rcpp
# Installing sparklyr takes a few minutes, becauses it installs +10 dependencies.
# Load sparklyr package.
# Alternatively load the SparkR package
# Use for data manipulation
Fetch the secrets in spark.conf
Again, the output of
secret is The secret key value will be replaced with
[REDACTED] when it is displayed to an output cell. Secret redaction is an expected behavior to prevent printing the value to standard output. The secret key value will be replaced with
[REDACTED] when it is displayed to an output cell.
sc <- sparklyr::spark_connect(method = "databricks", config = spark_config("secret"))
# This can probably also be done in sparklyr (future work)
secret <- SparkR::sparkR.conf("secret")
df <- sparklyr::spark_read_jdbc(sc,
name ="my_table" ,
options = list(url = "jdbc:sqlserver://<servername>.database.windows.net:<port>;database=<databasename>",
user = "<username>",
driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver",
password = secret$secret,
dbtable = "<tablename>")) %>%
Use dplyr commands to wrangle data
Sparklyr allows the use of dplyr while distributing across spark clusters. This makes the bridge from development to production smoother than SparkR since many of the same dplyr functions can be applied across clusters without needing to change syntax for Spark.
display command is provided by Databricks. It makes pretty tables and allows quick plots for data via GUI.