Accessing csv file in Azure blob storage via Rstudio Server with Sparklyr spark_read_csv()

I have provisioned an Azure HDInsight cluster type ML Services (R Server), operating system Linux, version ML Services 9.3 on Spark 2.2 with Java 8 HDI 3.6.

Within Rstudio Server I am trying to read in a csv file from my blob storage.

origins <- file.path("wasb://MYDefaultContainer@MyStorageAccount.blob.core.windows.net",
                     "user/RevoShare")
df2 <- spark_read_csv(sc,
                     path = origins,
                     name = 'Nov-MD-Dan',
                     memory = FALSE)

When I run this I get the following error

Error: java.lang.IllegalArgumentException: invalid method csv for object 235
	at sparklyr.Invoke$.invoke(invoke.scala:122)
	at sparklyr.StreamHandler$.handleMethodCall(stream.scala:97)
	at sparklyr.StreamHandler$.read(stream.scala:62)
	at sparklyr.BackendHandler.channelRead0(handler.scala:52)
	at sparklyr.BackendHandler.channelRead0(handler.scala:14)
	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
	at java.lang.Thread.run(Thread.java:748)

Any help would be awesome!

Here are the connection settings I am using


Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client")
Sys.setenv(YARN_CONF_DIR="/etc/hadoop/conf")
Sys.setenv(HADOOP_CONF_DIR="/etc/hadoop/conf")
Sys.setenv(SPARK_CONF_DIR="/etc/spark/conf")

options(rsparkling.sparklingwater.version = "2.2.28")
library(sparklyr)
library(dplyr)
library(h2o)
library(rsparkling)

sc <- spark_connect(master = "yarn-client",
                    version = "2.2.0")

This worked just fine for me in Azure HDInsights with RStudio Server (default settings in HDInsights)

library(sparklyr)
sc <- spark_connect(master = "yarn")

# iris.csv manually uploaded through Azures S3 portal
origins <- file.path("wasb://sparklyr-test-2018-11-17t17-40-50-339z@javiersparklyr.blob.core.windows.net/user/sshuser/iris.csv")

spark_read_csv(sc, path = origins, name = "iris")

Notice that by default, HDInsights installs sparklyr 0.6.3, which is a really old version. It also seems to be restricting installations to CRAN and devtools, so not sure how you would get around installing the latest version of sparklyr, currently, we are in version 0.9.2.

The error message seems related to this fix: https://github.com/rstudio/sparklyr/pull/1751, so it would be ideal to upgrade with something like the following (which fails in HDInsights, so maybe Microsoft support can help troubleshoot this):

install.packages("devtools")
devtools::install_github("rstudio/sparklyr")
Installation failed: error setting certificate verify locations:
  CAfile: microsoft-r-cacert.pem
  CApath: none

Could you share your connection code? Or connect with the code I mentioned above?

1 Like

I would also try connecting by adding the following to before spark_connect():

Sys.setenv(SPARK_HOME_VERSION="2.2.0")