Strange error trying to work with SparkR on rstudio-server

We have a new rstudio-server setup on JupyterHub built via Jupyter rsession-proxy as available here (not sure how relevant this is):

https://github.com/jupyterhub/jupyter-rsession-proxy/blob/master/jupyter_rsession_proxy/init.py

I am trying to do some simple commands in SparkR but hit an error that doesn't happen in terminal or IRKernel of Jupyter notebook:

The following is in ~/.Rprofile

DEFAULT_CONFIG = list(
  spark.cores.max = '8',
  spark.sql.sources.partitionColumnTypeInference.enabled = 'false',
  spark.executor.memory = '1g',
  spark.task.maxFailures = '10',
  spark.kubernetes.container.image = 'x' # suppressed
)

.start_sandbox = function() {
  library(SparkR, pos = 3L)
  library(magrittr, pos = 3L)
  eval(substitute(sparkR.session(
    appName = 'myTestApp',
    enableHiveSupport = TRUE,
    sparkConfig = DEFAULT_CONFIG
  )), envir = .GlobalEnv)
}

Now if I start Rstudio and run

.start_sandbox()

iris = iris
names(iris) = gsub('.', '_', names(iris), fixed = TRUE)
irisSDF = createDataFrame(iris)
irisSDF %>% head

Errors:

Error in (function (cl, name, valueClass)  : 
  assignment of an object of class "list" is not valid for @'sdf' in an object of class "SparkDataFrame"; is(value, "jobj") is not TRUE

But works as expected on R invoked from IRKernel or from terminal R:

  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Also works if we try to get the head by piping without assignment:

iris %>% createDataFrame %>% head

The only difference I can see that might lead to this is in how R is invoked:

commandArgs()
# RStudio
# [1] "RStudio"       "--interactive"
# Terminal R
# [1] "/opt/conda/lib/R/bin/exec/R"
# Jupyter IRKernel
# [1] '/opt/conda/lib/R/bin/exec/R' '--slave' '-e' 'IRkernel::main()' '--args' '/home/jovyan/.local/share/jupyter/runtime/kernel-4696a305-a022-45a0-be30-74bb2d4e8fa4.json'

Any idea what is happening here?

Sorry, I don't have a great idea. If I were debugging, I would try the following:

  1. Use options(error = recover) to see which method exactly is failing;

  2. Compare the objects within that method both in the "working" case and the "non-working" case -- is there something different?

  3. If not, verify that the environments truly are the same -- e.g. is the version of SparkR the same in each case? Is it being loaded from the same library in each case?

Hopefully, this gets you closer to the answer.

Thanks Kevin, quite helpful tips as always :slight_smile:

Of course as soon as I tried to dig deep again the problem stopped showing up reliably :man_facepalming:

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.