request for help making JDBC connection for use with spark_read_jdbc() in sparklyr

I am trying to establish a JDBC connection. The goal is to use the JDBC connection in conjunction with Spark. Ideally, this would be accomplished with spark_read_jdbc().

I receive an error when trying to define the driver. Any help is appreciated. Code is below, and please let me know if I can provide additional information.

Sys.setenv(JAVA_HOME = "/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.242.b08-0.el7_7.x86_64")

require(RJDBC)
require(DBI)
require(rJava)

jconnect_jar_file <- "/opt/app/SybaseIQ/jConnect-16_0/classes/jconn4.jar"
jconnect_jar_class <- "com.sybase.jdbc4.jdbc.SybDriver"

jconnect_driver <- JDBC(jconnect_jar_class, jconnect_jar_file)

Here is the error I receive:

Error occurred during initialization of VM
java/lang/NoClassDefFoundError: java/lang/Object

Hi @tmoorman,

Is there a reason you need to use JDBC?

We generally see people having a better time using the sparklyr package if they don't need to use JDBC.

Thank you for the reply, Alex! My thinking was to use R to access one of our large databases. We have three servers. A server with R installed, a cluster with Hadoop/Spark installed, and a server with a database. My thinking was that we would go from the R server, then to Hadoop/Spark, then to the database. As far as I can tell, spark_read_jdbc is the way to do this to leverage dplyr/sparklyr.

Would you suggest something else?

Is there a reason you need to connect to the database from the Spark cluster?

There are a variety of ways to connect to a database directly from R. We have some information at db.rstudio.com that may be helpful.

You can connect directly to a Spark cluster using the sparklyr package, no JDBC needed.

Thank you for more questions, Alex. The database is large (petabytes), and I think we would benefit from using the Spark cluster's distributed compute. We can currently make an ODBC connection to the database from R. How would you recommend getting the data into the Spark cluster? Would it be better to have an instance of R installed on a Spark cluster edge node? Or would it be better to copy the data over from R to Spark after making the ODBC connection? My past experience with R has been limited to extracting data from a shared drive onto a desktop install. The new infrastructure with database, Spark, and R on different servers and how to best combine them is a mystery to me. Your guidance and suggestions are appreciated.

@tmoorman it looks like you've got the right idea. When working with Spark and dealing with large data in external databases, accessing the data from Spark can be a challenge. The correct approach is often either a data source specific connector or a JDBC connection. While ODBC is the suggested method for connecting to external databases from R, Spark relies on JDBC for creating these connections.

I don't suggest using ODBC to read the data into R and then transfer it to Spark. In this scenario, R becomes a bottleneck that all data has to pass through, which can be particularly challenging when working with petabytes of data. Instead, the idea is to create a direct connection from Spark directly to the database using JDBC.

Details about this process can be found in Chapter 8 of Mastering Spark with R. The section on JDBC connections is most relevant, but the entire chapter provides a good overview of connecting to different data sources from Spark.

I don't personally have experience connecting to Sybase from Spark. However, I'm hopeful these resources can point you in the right direction for getting things configured appropriately. If you have additional questions, don't hesitate to update here.

1 Like

Thank you, @Blair09M. This is helpful. I do have some follow up questions based on what I saw in the Mastering Spark with R book. Does the Java software for making the JDBC connection need to be installed on the Spark/Hadoop cluster (as opposed to the R server)? If so, do you know what Shell code is needed to identify the path to the "Java Home" and the "Spark Home"? Thanks again for your help!

You're correct - Java and the necessary JDBC components need to be available in the Spark/Hadoop cluster. The sparklyr package has two functions that may be helpful here:
sparklyr::spark_get_java() will return the path to JAVA_HOME
sparklyr::spark_home_dir() will return the path to SPARK_HOME

@Blair09M . I may have figured out the problem. Even if I am connecting to a Spark cluster on another computer in the network, it looks like I still need to have a locally installed version of Spark to communicate with the cluster.

I used RStudio's IDE to try to make a connection and received the following message:

"Connecting with a Spark cluster requires that you are on a system able to communicate with the cluster in both directions, and requires that the SPARK_HOME environment variable refers to a locally installed version of Spark that is configured to communicate with the cluster.

Your system doesn't currently have the SPARK_HOME environment variable defined. Please contact your system administrator to ensure that the server is properly configured to connect with the cluster."

Hopefully we can get a local installation of Spark on the server and the connection should happen.

Thanks for the update - hopefully you're able to get the issue resolved. Chapter 7 in Mastering Spark with R contains some good information about different connections that can be created to Spark clusters.

@Blair09M I installed a local instance of Spark, but I'm still getting crazy errors. If I try to make a Livy connection, the connection times out. Any suggestions?

Code:

sc <- spark_connect(  master = "spark://##.###.##.##:7077", 
spark_home = spark_home_dir(), 
username = "user", 
password = "password") 

Error:

Error in force(code) : Failed during initialize_connection: java.lang.IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running MetricsSystem at scala.Predef$.require(Predef.scala:224) at org.apache.spark.metrics.MetricsSystem.getServletHandlers(MetricsSystem.scala:91) at org.apache.spark.SparkContext.<init>(SparkContext.scala:515) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2486) at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:930) at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:921) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at sparklyr.Invoke.invoke(invoke.scala:147) at sparklyr.StreamHandler.handleMethodCall(stream.scala:122) at sparklyr.StreamHandler.read(stream.scala:65) at sparklyr.BackendHandler.channelRead0(handler.scala:53) at sparklyr.BackendHandler.channelRead0(handler.scala:12) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) at java.lang.Thread.run(Thread.java:748) Log: /tmp/RtmpXpr6rL/filed8c74171f8d6_spark.log ---- Output Log ---- 2020-04-30 13:03:20 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:54 - Asking each executor to shut down 2020-04-30 13:03:20 INFO BlockManagerMasterEndpoint:54 - Registering block manager xxx####xxxxxxxxx:42582 with 366.3 MB RAM, BlockManagerId(driver, xxx####xxxxxxxxx, 42582, None) 2020-04-30 13:03:20 WARN StandaloneAppClient$ClientEndpoint:66 - Drop UnregisterApplication(null) because has not yet connected to master 2020-04-30 13:03:20 INFO BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, mtb0120ppcdwmerb, 42582, None) 2020-04-30 13:03:20 INFO BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, mtb0120ppcdwmerb, 42582, None) 2020-04-30 13:03:20 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped! 2020-04-30 13:03:20 INFO MemoryStore:54 - MemoryStore cleared 2020-04-30 13:03:20 INFO BlockManager:54 - BlockManager stopped 2020-04-30 13:03:20 INFO BlockManagerMaster:54 - BlockManagerMaster stopped 2020-04-30 13:03:20 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped! 2020-04-30 13:03:20 ERROR SparkContext:91 - Error initializing SparkContext. java.lang.IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running MetricsSystem at scala.Predef$.require(Predef.scala:224) at org.apache.spark.metrics.MetricsSystem.getServletHandlers(MetricsSystem.scala:91) at org.apache.spark.SparkContext.<init>(SparkContext.scala:515) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2486) at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:930) at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:921) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at sparklyr.Invoke.invoke(invoke.scala:147) at sparklyr.StreamHandler.handleMethodCall(stream.scala:122) at sparklyr.StreamHandler.read(stream.scala:65) at sparklyr.BackendHandler.channelRead0(handler.scala:53) at sparklyr.BackendHandler.channelRead0(handler.scala:12) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.jav

Livy can be difficult to work with, especially in this context where you are trying to initiate a connection to a remote Spark environment and then from there initiate a JDBC connection. Instead, it's advised to access R and use sparklyr from an edge node of the Spark cluster itself. Is that option a possibility? Unfortunately, I'm rather limited in my experience with Livy.

@Blair09M This is what I was wondering. Using sparklyr from an edge node is a possibility. In that case, would I be making a "local" connection? Or, is there some kind of "yarn" connection that is preferable? Thanks again for chiming in. It sounds like I may be getting closer.

@tmoorman this will be a function of how your Spark cluster is setup. Chapter 6 in Mastering Spark with R contains details about different cluster configurations and the connection methods used for them. If you aren't directly involved in the standup and maintenance of the Spark cluster, it would likely be beneficial to work with someone who is to determine the appropriate method for connecting.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.