Running Spark on RStudio Cloud?

rstudiocloud
spark
sparklyr

#1

I am trying to run the sparklyr package on RStudio Cloud. Installing works, connecting sometimes works sometimes I get this error:

Invalid maximum heap size: -Xmx0.8
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

When it works I still have problems copying data to spark using the copy_to() or spark_read_csv() commands. Both do not seem to work.


#2

Unfortunately, rstudio.cloud instances are currently limited to 1GB of RAM so isn't a great place to be running Spark.


#3

It is for teaching so an example with not too much data would be fine as well. When I configure spark to use less than 1GB it still gives problems.


#4

"0.8" is not a valid value for -Xmx. You need to supply an integer, and it can be followed by "k", "m" or "g".

https://docs.oracle.com/javase/8/docs/technotes/tools/unix/java.html#BABHDABI


#5

Thanks! Now I get a different error when I run the copy_to() command:

Error: Unexpected state in sparklyr backend, terminating connection: failed to invoke spark command
18/07/07 08:45:05 INFO DAGScheduler: Job 0 finished: collect at utils.scala:43, took 1.197268 s

#6

Just before running copy_to(), what is the output of gc()?
My guess is you are running out of memory.


#7

The result of the gc() command just before running copy_to()

          used (Mb) gc trigger  (Mb) max used  (Mb)
Ncells 1093099 58.4    2033582 108.7  2033582 108.7
Vcells 6843222 52.3   22637548 172.8 18639753 142.3