Sparlyr connection to Livy is Very Slow

Hi All,

I have AWS EMR cluster and RStudio server installed on other EC2 instance, when i try to connect from RStudio to EMR Master with method 'livy' the connection is very slow, it takes more than 3min to make the connection and also to read the data in each hive table also same. I'm using EMR 5.20. Any idea, why?

library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "http://emr master ip:8998", method = "livy")

I wanted to use Livy because i want to have dedicated RStudio server with multiple users and each user will get their own dedicated EMR Cluster to run their models. So i guess Livy would be suitable for this requirement as a REST interface to spark.

Any help to resolve this slowness issue would be appreciated.

Thanks,
Asif

Hey Asif,

Sorry to hear about the trouble! Unfortunately, I believe Livy is somewhat well-known for being a slow / not very friendly way to access Spark. Is there a reason that you are setting up the RStudio Server separate from the EMR clusters? Would it be feasible to set up an RStudio Server as an edge node on each cluster? That is our generally recommended architecture, as it allows interfacing with Spark directly and not having to go through Livy.

Is this RStudio Server Pro, or the open source version? If RStudio Server Pro is the reason for limiting yourself to a single node, you could always ask about named-user pricing, which is priced per user and allows as many servers as you like.

Thanks!

Cole

1 Like

Thanks Cole for your reply.

I currently have RStudio on edge node, but we are planning to have transient aws emr cluster for each data scientist, but use one RStudio. I thought Livy would be good for this so each user makes spark connection to their dedicated transient cluster's Livy server.

I guess i can also setup edge note for each user with different environment variables, so each user connects to different EMR cluster's 'YARN'. I'm trying this but having issues with spark lib jars "Failed during initialize_connection: org.apache.hadoop.ipc.RemoteException(java.io.IOException)"...trying to investigate this now.

But is this the only way to have multiple RStudio users to different EMR Cluster independently or is there any other option? please advise.

Currently this is a POC, but my client may be interested to buy Pro once this is proved.

Thanks,
Asif

Unfortunately, I am definitely not the most experienced person when it comes to YARN and java.io.IOException issues. I think this community is the right place for those questions though!

I am also not aware of any other options when it comes to multiple RStudio users connecting to a different EMR cluster independently. We generally recommend that all of our customers set up RStudio on an edge node of the cluster, which I understand is troubling for your infrastructure.

This is not a solution today, but in RStudio 1.2, we are building more flexibility around where R and the user's R session get executed. As a result, it is plausible that in the future, there may be more support for this paradigm using this architecture. Our aim in version 1.2 is focusing on orchestration tools like Kubernetes and Slurm.

This blog post has more reading on the topic

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.