I extract data from Hive databases on our Cloudera Hadoop cluster using
dbplyr combined with
Now IT is imposing a change in that odbc Hive connections should be made using native queries to reduce performance overhead. On Linux this means including
UseNativeQuery=1 in the
odbc.ini file. On Windows it's under the advanced options in the
ODBC Data Source Adminstrator tool.
This change causes issues when using
DBI still works with the appropriate SQL though.
Please note that this issue may share common elements with this previous post:
I have uploaded a copy of
mtcars to our container, so the following is not directly reproducible, but hopefully it demonstrates the issues and errors:
suppressPackageStartupMessages(library(dplyr)) # 0.8.0.1 #> Warning: package 'dplyr' was built under R version 3.4.4 suppressPackageStartupMessages(library(dbplyr)) # 1.4.0 # odbc connection using native queries con_native <- DBI::dbConnect(odbc::odbc(), "authentication") # unsuccessful with DBI, but managed to upload copy of mtcars to Hive via: # RODBC::sqlSave(con, mtcars, "container_name.tmp_mtcars", rownames = "car") # container_name hidden mtcars_native <- tbl(con_native, in_schema("container_name", "tmp_mtcars")) # output is as expected with table name prefix for native queries # `mpg` is recognised as column name even without table name prefix mtcars_native %>% filter(mpg > 30) #> # Source: lazy query [?? x 12] #> # Database: Hive 1.1.0-cdh5.14.4[@Hive/HIVE] #> tmp_mtcars.car tmp_mtcars.mpg tmp_mtcars.cyl tmp_mtcars.disp #> <chr> <dbl> <dbl> <dbl> #> 1 Fiat 128 32.4 4 78.7 #> 2 Honda Civic 30.4 4 75.7 #> 3 Toyota Corolla 33.9 4 71.1 #> 4 Lotus Europa 30.4 4 95.1 #> # ... with 8 more variables: tmp_mtcars.hp <dbl>, tmp_mtcars.drat <dbl>, #> # tmp_mtcars.wt <dbl>, tmp_mtcars.qsec <dbl>, tmp_mtcars.vs <dbl>, #> # tmp_mtcars.am <dbl>, tmp_mtcars.gear <dbl>, tmp_mtcars.carb <dbl> # same result (not run) # head(DBI::dbGetQuery(con_native, "SELECT * FROM container_name.tmp_mtcars WHERE (`mpg` > 30.0)")) # column names not recognised mtcars_native %>% select(car, mpg) #> Error in .f(.x[[i]], ...): object 'car' not found # output successful, but without table name prefix head(DBI::dbGetQuery(con_native, "SELECT `car`, `mpg` FROM container_name.tmp_mtcars")) #> car mpg #> 1 Mazda RX4 21.0 #> 2 Mazda RX4 Wag 21.0 #> 3 Merc 280C 17.8 #> 4 Merc 450SE 16.4 #> 5 Merc 450SL 17.3 #> 6 Merc 450SLC 15.2 # same scenario with mutate (not run) # mtcars_native %>% # mutate(kpl = mpg * 0.425) # output successful even with mutate, but without table name prefix mtcars_native %>% group_by(cyl) %>% summarise(mpg_mean = mean(mpg, na.rm = TRUE)) %>% mutate(kpl_mean = mpg_mean * 0.425) #> # Source: lazy query [?? x 3] #> # Database: Hive 1.1.0-cdh5.14.4[@Hive/HIVE] #> cyl mpg_mean kpl_mean #> <dbl> <dbl> <dbl> #> 1 4 26.7 11.3 #> 2 6 19.7 8.39 #> 3 8 15.1 6.42 # same result (not run) # head(DBI::dbGetQuery(con_native, "SELECT `cyl`, `mpg_mean`, `mpg_mean` * 0.425 AS `kpl_mean` FROM # (SELECT `cyl`, AVG(`mpg`) AS `mpg_mean` FROM container_name.tmp_mtcars GROUP BY `cyl`) `dbplyr_001`")) DBI::dbDisconnect(con_native)
Created on 2019-06-05 by the reprex package (v0.3.0)
mutate) queries fail with
dbplyr due to the column names not being recognised without the table name prefix, but are successful with
DBI. However, other standard syntax (
summarise) works with just the raw column names.
mutate are, however, successful following such a
Therefore my questions are:
dbplyrbe enhanced to allow the use of Hive native queries?
- Or are there any other workarounds using native queries which do not involve having to use SQL strings which become unreadable with complex queries?
I'm happy to provide any further system details I can, but I may be limited because the system is not managed by our team.