Connect to remote spark from local windows env

I am in trouble to connect the remote spark env by sparklyr. The following is the code that I used to make the connections (To yarn and spark).

spark_connect(master = "thrift://remote spark ip:9083", spark_home = 'spark home in remote spark', version = '2.3.1')
spark_connect(master = "spark://remote spark ip:7077", spark_home = 'spark home in remote spark', version = '2.3.1')

Both above throwed the same error message:

Error in start_shell(master = master, spark_home = spark_home, spark_version = version,  : 
  SPARK_HOME directory '/usr/hdp/current/spark2-client/' not found

Additional info:
R version : 3.5.2
sparklyr version : 0.9.3
Spark version : 2.3.1

thrift:// connections are not provided by Apache Spark but rather by Apache Hive. If you need to connect to a Thrift server for data analysis using Hive (not for modeling, distributed processing, streaming, etc), then you can consider using the odbc package with a generic ODBC driver or an RStudio Professional Driver: rstudio.com/products/drivers.

In order to connect to remote Spark instances, there are only two options to follow; however, they both have tradeoffs:

  1. Install Spark in your local machine, configure properly, and connect to the remote machine using the local spark_home; however, this approach requires careful configuration and high network bandwidth between the local machine and the remote machine. Therefore, unless your local machine is inside the Spark cluster, this is not recommended.
  2. Use Apache Livy to connect remotely through HTTP; however, Livy has significant performance tradeoffs that make it painful, but possible, to work remotely against a Spark cluster. Therefore, this approach is also not recommended.

Instead, the recommended approach is to install RStudio Server or RStudio Server Pro in the Spark cluster; which then you can connect to using a local web browser. You can install RStudio Server from https://www.rstudio.com/products/rstudio/.

Thanks much. That's helpful.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.