dbplyr Issues/Errors with Hive Native Queries

I extract data from Hive databases on our Cloudera Hadoop cluster using dbplyr combined with DBI and odbc.

Now IT is imposing a change in that odbc Hive connections should be made using native queries to reduce performance overhead. On Linux this means including UseNativeQuery=1 in the odbc.ini file. On Windows it's under the advanced options in the ODBC Data Source Adminstrator tool.

This change causes issues when using dbplyr. DBI still works with the appropriate SQL though.

Please note that this issue may share common elements with this previous post:

I have uploaded a copy of mtcars to our container, so the following is not directly reproducible, but hopefully it demonstrates the issues and errors:

suppressPackageStartupMessages(library(dplyr))  # 0.8.0.1
#> Warning: package 'dplyr' was built under R version 3.4.4
suppressPackageStartupMessages(library(dbplyr)) # 1.4.0

# odbc connection using native queries
con_native <- DBI::dbConnect(odbc::odbc(), "authentication")

# unsuccessful with DBI, but managed to upload copy of mtcars to Hive via:
# RODBC::sqlSave(con, mtcars, "container_name.tmp_mtcars", rownames = "car")

# container_name hidden
mtcars_native <- tbl(con_native, in_schema("container_name", "tmp_mtcars"))

# output is as expected with table name prefix for native queries
# `mpg` is recognised as column name even without table name prefix
mtcars_native %>% 
  filter(mpg > 30)
#> # Source:   lazy query [?? x 12]
#> # Database: Hive 1.1.0-cdh5.14.4[@Hive/HIVE]
#>   tmp_mtcars.car tmp_mtcars.mpg tmp_mtcars.cyl tmp_mtcars.disp
#>   <chr>                   <dbl>          <dbl>           <dbl>
#> 1 Fiat 128                 32.4              4            78.7
#> 2 Honda Civic              30.4              4            75.7
#> 3 Toyota Corolla           33.9              4            71.1
#> 4 Lotus Europa             30.4              4            95.1
#> # ... with 8 more variables: tmp_mtcars.hp <dbl>, tmp_mtcars.drat <dbl>,
#> #   tmp_mtcars.wt <dbl>, tmp_mtcars.qsec <dbl>, tmp_mtcars.vs <dbl>,
#> #   tmp_mtcars.am <dbl>, tmp_mtcars.gear <dbl>, tmp_mtcars.carb <dbl>

# same result (not run)
# head(DBI::dbGetQuery(con_native, "SELECT * FROM container_name.tmp_mtcars WHERE (`mpg` > 30.0)"))

# column names not recognised
mtcars_native %>% 
  select(car, mpg)
#> Error in .f(.x[[i]], ...): object 'car' not found

# output successful, but without table name prefix
head(DBI::dbGetQuery(con_native, "SELECT `car`, `mpg` FROM container_name.tmp_mtcars"))
#>             car  mpg
#> 1     Mazda RX4 21.0
#> 2 Mazda RX4 Wag 21.0
#> 3     Merc 280C 17.8
#> 4    Merc 450SE 16.4
#> 5    Merc 450SL 17.3
#> 6   Merc 450SLC 15.2

# same scenario with mutate (not run)
# mtcars_native %>% 
#   mutate(kpl = mpg * 0.425)

# output successful even with mutate, but without table name prefix
mtcars_native %>% 
  group_by(cyl) %>% 
  summarise(mpg_mean = mean(mpg, na.rm = TRUE)) %>% 
  mutate(kpl_mean = mpg_mean * 0.425)
#> # Source:   lazy query [?? x 3]
#> # Database: Hive 1.1.0-cdh5.14.4[@Hive/HIVE]
#>     cyl mpg_mean kpl_mean
#>   <dbl>    <dbl>    <dbl>
#> 1     4     26.7    11.3 
#> 2     6     19.7     8.39
#> 3     8     15.1     6.42

# same result (not run)
# head(DBI::dbGetQuery(con_native, "SELECT `cyl`, `mpg_mean`, `mpg_mean` * 0.425 AS `kpl_mean` FROM 
#                      (SELECT `cyl`, AVG(`mpg`) AS `mpg_mean` FROM container_name.tmp_mtcars GROUP BY `cyl`) `dbplyr_001`"))

DBI::dbDisconnect(con_native)

Created on 2019-06-05 by the reprex package (v0.3.0)

Standard select (and mutate) queries fail with dbplyr due to the column names not being recognised without the table name prefix, but are successful with DBI. However, other standard syntax (filter, group_by, summarise) works with just the raw column names. select/mutate are, however, successful following such a summarise operation.

Therefore my questions are:

  • Can dbplyr be enhanced to allow the use of Hive native queries?
  • Or are there any other workarounds using native queries which do not involve having to use SQL strings which become unreadable with complex queries?

I'm happy to provide any further system details I can, but I may be limited because the system is not managed by our team.

Hi, it may be better to move this to an Issue in dbplyr's GitHub repository, can you open one there please? https://github.com/tidyverse/dbplyr/issues

Done:

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.