Dplyr::tbl not working with HIVE

I am interacting with a HIVE database using RStudio and RJDBC.

I can create a connection a db connection and use it as follows.

library(dplyr)
library(dbplyr)
library(DBI)
library(tidyr)
library(readr)
library(haven)
library(stringi)
library(readxl)
library(RSQLite)
library(StatMatch)
library(transport)
library(purrr)
library(sparklyr)
library(dplyr)
library("DBI")
library(tcltk)
options(java.parameters = "-Xmx8G") # or whatever g u wanna set to
library("rJava")

library("RJDBC")

cp = c("/usr/lib/hive/hive-jdbc.jar"
, "/usr/lib/hive/hadoop-common.jar"
, "/usr/lib/hive/libthrift-0.9.2.jar"
, "/usr/lib/hive/hive-service.jar"
, "/usr/lib/hive/httpclient-4.2.5.jar"
, "/usr/lib/hive/httpcore-4.2.5.jar"
, "/usr/lib/hive/hive-jdbc-standalone.jar")
.jinit(classpath=cp)
drv <- JDBC("org.apache.hive.jdbc.HiveDriver"
, "hive-jdbc.jar" )
conn <- dbConnect(drv
, "jdbc:hive2://myhadoopcluster:10000/default"
, "myusername", "mypassword")
databases <- dbGetQuery(conn, "show databases")

works fine

kpfilenames<-dbListTables(conn)
also gives me a list of tables

However, when I want to get a connection to a table using the tbl function I get the following error:

kp1<-tbl(src_dbi(conn),"coreprofileformammain")
Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", :
Unable to retrieve JDBC result set for SELECT *
FROM "coreprofileformammain" AS "zzz1"
WHERE (0 = 1) (Error while compiling statement: FAILED: ParseException line 2:5 cannot recognize input near '"coreprofileformammain"' 'AS' '"zzz1"' in join source)

The error seems to indicate that tbl is putting double quotes around the table names in the sql call

In fact, I get the identical error in HUE if I put quotes around the table names in the query.

Error while compiling statement: FAILED: ParseException line 2:5 cannot recognize input near '"coreprofileformammain"' 'AS' '"zzz1"' in join source

But in HUE the query of course runs correctly if the quotes are not present.

How do I get tbl to NOT pass the quotes through the database driver?

Have you tried it without the src_dbi bit?

i.e.
kp1<-tbl(conn,“coreprofileformammain”)

I use odbc to connect to Hive and it works very well with dbplyr which you have in your list.

RJDBC is not the preferred connection method as mentioned in another thread:

Also check:
https://db.rstudio.com/

Yes I have tried it with and without src_dbi. Where did you get the ODBC driver? (This is an experimental system, I can't afford to pay thousands of dollars just to get a driver and that seems to be the going rate)

I downloaded them from the Cloudera site for free.
https://www.cloudera.com/downloads.html

They also have JDBC drivers, but I never used them.

If you happen to be an RStudio professional customer, we bundle a bunch of ODBC drivers for free

Hi @mikeC, we have a Hive specific page in the site here: http://db.rstudio.com/databases/hive/

It may be possible to get your current RJDBC connection working by pointing the connection class to the Hive translation:

sql_translate_env.JDBCConnection <- dbplyr:::sql_translate_env.Hive 
db_analyze.JDBCConnection <- dbplyr:::db_analyze.Hive

Thanks, I hadn't been able to find these before....I will give it a try

The sql_translate_env stuff did not work but the blog page and other info is helpful. Now I have an ODBC driver to try and install.

1 Like

Installed odbc driver from Cloudera - now tbl works

> kp1<-tbl(con,"coreprofileformammain")

> kp1
# Source:   table<coreprofileformammain> [?? x 221]
# Database: Hive 1.1.0-cdh5.12.1[mconklin@Hive/HIVE]
   coreprofileformammain.mno coreprofileformammain.stno coreprofileformammain.hhno
             <S3: integer64>            <S3: integer64>            <S3: integer64>
 1                        87                          6                         46
 2                        88                          6                         46
 3                        89                          6                         46
 4                        98                          0                         50
 5                        99                          0                         50
 6                       106                          6                         55
 7                       153                          0                         76
 8                       154                          0                         76
 9                       267                          0                        146
10                       268                          0                        146
# ... with more rows, and 218 more variables: coreprofileformammain.ppage <S3:
#

Now however, I cannot seem to access any of the variables in this tbl

> select(kp1,coreprofileformammain.mno)
Error in overscope_eval_next(overscope, expr) : 
  object 'coreprofileformammain.mno' not found
> select(kp1,mno)
Error in overscope_eval_next(overscope, expr) : object 'mno' not found

But if I pull the data over (via collect) the names seem right. The problem is that I need to do joins in the database instead of pulling the data into R and doing the joins. The join functions claim that the by variables (specified as coreprofileformammain.mno) can't be found (just like the select command).

names(collect(kp1))
  [1] "coreprofileformammain.mno"             "coreprofileformammain.stno"           
  [3] "coreprofileformammain.hhno"            "coreprofileformammain.ppage"          
  [5] "coreprofileformammain.ppagecat"        "coreprofileformammain.ppagect4"       
  [7] "coreprofileformammain.ppagect5"        "coreprofileformammain.ppeduc"         

Anyone have any ideas?

So if I use the specify the schema correctly instead of using the default schema it works. Thanks for the pointers and advice everyone.

con <- DBI::dbConnect(odbc::odbc(),
                      Driver = "[your driver's name]",
                      Host   = "[your server's path]",
                      Schema = "[your schema's name]",
                      UID    = rstudioapi::askForPassword("Database user"),
                      PWD    = rstudioapi::askForPassword("Database password"),
                      Port   = 10000)
1 Like

@mikeC, you can also do this so you don't need to specify multiple connections if you need to access different schemas:

kp1 <- tbl(con, in_schema("schema", "table"))

1 Like