Can I connect to Azure Data Lake using sparklyr in RStudio?

Hlynur · February 17, 2021, 4:15pm

Hi, I've got a connection to Azure Databricks that I can successfully access through sparklyr in RStudio. But now I want to access data in Azure Data Lake using that spark cluster. I can do this in a Databricks notebook in the cloud using the following Python code:

Python

spark.conf.set(
  "fs.azure.account.key.{OurStorageAccount}.dfs.core.windows.net",
  "{OurAccessKey}")

I'm using the approach put forth in this RStudio guide. Which lead me to believe I could perhaps do something like this in RStudio using sparklyr:

R

library(sparklyr)

conf <- spark_config()
conf$fs.azure.account.key.stddatalake.dfs.core.windows.net <- "{OurAccessKey}"

sc <- spark_connect(method = "databricks", 
                    spark_home = "/Users/{...}/opt/anaconda3/lib/python3.8/site-packages/pyspark",
                    config = conf)

But then running a spark_read_csv call results in an error saying there is a failure, mentioning getStorageAccountKey

Error: com.databricks.service.SparkServiceRemoteException: Failure to initialize configuration
at shaded.databricks.{...}.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:51)
at shaded.databricks.{...}.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:412)
at shaded.databricks.{...}.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1016)
at shaded.databricks.{...}.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.<init>(AzureBlobFileSystemStore.java:151)
at shaded.databricks.{...}.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:137)

...

So the question is, how could I do the conf$... <- "{OurAccessKey}" call to handle the storage account key correctly?

Many thanks in advance!

Hlynur · February 18, 2021, 5:56pm

I've figured it out, and I'm posting the solution for posterity.

The access key can be passed in the options argument of the spark_read_* function, as a named list item.

library(sparklyr)

sc <- spark_connect(method = "databricks", 
                    spark_home = "/Users/{...}/opt/anaconda3/lib/python3.8/site-packages/pyspark")


storage_root <- "abfss://{OurContainerName}@{OurStorageAccount}.dfs.core.windows.net/"
file_path <- paste0(storage_root, "Sandbox/Demo/NycTaxi/yellow_trips/Year=2020/yellow_tripdata_2020-01.csv")

taxi_data <- spark_read_csv(sc, 
                            path = file_path,
                            header = TRUE,
                            infer_schema = TRUE,
                            options = list("fs.azure.account.key.{OurStorageAccount}.dfs.core.windows.net" = "{OurAccessKey")
)

system · February 25, 2021, 5:56pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.