Dplyr::tbl not working with HIVE

mikeC · November 16, 2017, 9:56pm

I am interacting with a HIVE database using RStudio and RJDBC.

I can create a connection a db connection and use it as follows.

library(dplyr)
library(dbplyr)
library(DBI)
library(tidyr)
library(readr)
library(haven)
library(stringi)
library(readxl)
library(RSQLite)
library(StatMatch)
library(transport)
library(purrr)
library(sparklyr)
library(dplyr)
library("DBI")
library(tcltk)
options(java.parameters = "-Xmx8G") # or whatever g u wanna set to
library("rJava")

library("RJDBC")

cp = c("/usr/lib/hive/hive-jdbc.jar"
, "/usr/lib/hive/hadoop-common.jar"
, "/usr/lib/hive/libthrift-0.9.2.jar"
, "/usr/lib/hive/hive-service.jar"
, "/usr/lib/hive/httpclient-4.2.5.jar"
, "/usr/lib/hive/httpcore-4.2.5.jar"
, "/usr/lib/hive/hive-jdbc-standalone.jar")
.jinit(classpath=cp)
drv <- JDBC("org.apache.hive.jdbc.HiveDriver"
, "hive-jdbc.jar" )
conn <- dbConnect(drv
, "jdbc:hive2://myhadoopcluster:10000/default"
, "myusername", "mypassword")
databases <- dbGetQuery(conn, "show databases")

works fine

kpfilenames<-dbListTables(conn)
also gives me a list of tables

However, when I want to get a connection to a table using the tbl function I get the following error:

kp1<-tbl(src_dbi(conn),"coreprofileformammain")
Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", :
Unable to retrieve JDBC result set for SELECT *
FROM "coreprofileformammain" AS "zzz1"
WHERE (0 = 1) (Error while compiling statement: FAILED: ParseException line 2:5 cannot recognize input near '"coreprofileformammain"' 'AS' '"zzz1"' in join source)

The error seems to indicate that tbl is putting double quotes around the table names in the sql call

In fact, I get the identical error in HUE if I put quotes around the table names in the query.

Error while compiling statement: FAILED: ParseException line 2:5 cannot recognize input near '"coreprofileformammain"' 'AS' '"zzz1"' in join source

But in HUE the query of course runs correctly if the quotes are not present.

How do I get tbl to NOT pass the quotes through the database driver?

martin.R · November 16, 2017, 10:15pm

Have you tried it without the src_dbi bit?

i.e.
kp1<-tbl(conn,“coreprofileformammain”)

I use odbc to connect to Hive and it works very well with dbplyr which you have in your list.

RJDBC is not the preferred connection method as mentioned in another thread:

Also check:
https://db.rstudio.com/

mikeC · November 16, 2017, 10:21pm

Yes I have tried it with and without src_dbi. Where did you get the ODBC driver? (This is an experimental system, I can't afford to pay thousands of dollars just to get a driver and that seems to be the going rate)

martin.R · November 16, 2017, 10:28pm

I downloaded them from the Cloudera site for free.
https://www.cloudera.com/downloads.html

They also have JDBC drivers, but I never used them.

hadley · November 16, 2017, 10:57pm

If you happen to be an RStudio professional customer, we bundle a bunch of ODBC drivers for free

edgararuiz · November 17, 2017, 4:34pm

Hi @mikeC, we have a Hive specific page in the site here: http://db.rstudio.com/databases/hive/

It may be possible to get your current RJDBC connection working by pointing the connection class to the Hive translation:

sql_translate_env.JDBCConnection <- dbplyr:::sql_translate_env.Hive 
db_analyze.JDBCConnection <- dbplyr:::db_analyze.Hive

mikeC · November 17, 2017, 7:25pm

Thanks, I hadn't been able to find these before....I will give it a try

mikeC · November 17, 2017, 7:26pm

The sql_translate_env stuff did not work but the blog page and other info is helpful. Now I have an ODBC driver to try and install.

mikeC · November 20, 2017, 4:52pm

Installed odbc driver from Cloudera - now tbl works

> kp1<-tbl(con,"coreprofileformammain")

> kp1
# Source:   table<coreprofileformammain> [?? x 221]
# Database: Hive 1.1.0-cdh5.12.1[mconklin@Hive/HIVE]
   coreprofileformammain.mno coreprofileformammain.stno coreprofileformammain.hhno
             <S3: integer64>            <S3: integer64>            <S3: integer64>
 1                        87                          6                         46
 2                        88                          6                         46
 3                        89                          6                         46
 4                        98                          0                         50
 5                        99                          0                         50
 6                       106                          6                         55
 7                       153                          0                         76
 8                       154                          0                         76
 9                       267                          0                        146
10                       268                          0                        146
# ... with more rows, and 218 more variables: coreprofileformammain.ppage <S3:
#

Now however, I cannot seem to access any of the variables in this tbl

> select(kp1,coreprofileformammain.mno)
Error in overscope_eval_next(overscope, expr) : 
  object 'coreprofileformammain.mno' not found
> select(kp1,mno)
Error in overscope_eval_next(overscope, expr) : object 'mno' not found

But if I pull the data over (via collect) the names seem right. The problem is that I need to do joins in the database instead of pulling the data into R and doing the joins. The join functions claim that the by variables (specified as coreprofileformammain.mno) can't be found (just like the select command).

names(collect(kp1))
  [1] "coreprofileformammain.mno"             "coreprofileformammain.stno"           
  [3] "coreprofileformammain.hhno"            "coreprofileformammain.ppage"          
  [5] "coreprofileformammain.ppagecat"        "coreprofileformammain.ppagect4"       
  [7] "coreprofileformammain.ppagect5"        "coreprofileformammain.ppeduc"

Anyone have any ideas?

mikeC · November 20, 2017, 8:28pm

So if I use the specify the schema correctly instead of using the default schema it works. Thanks for the pointers and advice everyone.

con <- DBI::dbConnect(odbc::odbc(),
                      Driver = "[your driver's name]",
                      Host   = "[your server's path]",
                      Schema = "[your schema's name]",
                      UID    = rstudioapi::askForPassword("Database user"),
                      PWD    = rstudioapi::askForPassword("Database password"),
                      Port   = 10000)

martin.R · November 21, 2017, 5:55pm

@mikeC, you can also do this so you don't need to specify multiple connections if you need to access different schemas:

kp1 <- tbl(con, in_schema("schema", "table"))