Problem with odbc sql transformation using dbplyr

dbplyr
odbc
tbl

#1

(This is related to a recent post however I'm asking a clean and fresh version of the question)

My dbplyr sql commands seem to be going unrecognized and causing errors. These are very basic commands (select, filter, mutate) that I believe are expected to translate well, leading me to believe I have a issue elsewhere.

I have a connection to our Hive instance using odbc and as I mentioned I'm using dbplyr package to pull data from hive into R.

Without actually providing a odbc connection to our Hive instance to people on the internet, it's tricky to provide a reproducible example. However, I'm hoping that the information I provide here can help diagnose.

Here's the top of my script as I connect to a Hive table:

library(odbc)
library(tidyverse)
library(dbplyr)
library(lubridate)
library(zoo)

# hive connection
con <- dbConnect(odbc::odbc(), "HiveProd")


# tables
order_discrepancy <- tbl(con, sql("select * 
                                  from myname.pro_programe_compare_uds_order
                                  where (inorder is null) or (intos is null)"))

Now, if I glimpse order_discrepancy:

> glimpse(order_discrepancy)
Observations: ??
Variables: 6
$ czycmgxdvd.pro_shopper_id  <chr> "100319101", "100944143", "101742363", "102141637", "10290598", "103123012", "10421410", "104398857", "10536771", "105412334", "105686556...
$ czycmgxdvd.tos_signed_date <date> NA, 2015-03-19, NA, 2016-01-21, NA, NA, 2015-10-07, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2016-10-03, NA, NA, NA, NA, 2017-02-22, ...
$ czycmgxdvd.uds_order_date  <date> 2018-12-07, NA, 2018-12-07, NA, 2018-12-07, 2018-12-07, NA, 2018-09-17, 2018-12-09, 2018-12-09, 2018-12-08, 2018-12-10, 2018-12-10, 2018...
$ czycmgxdvd.date_diff_days  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ czycmgxdvd.inorder         <lgl> TRUE, NA, TRUE, NA, TRUE, TRUE, NA, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, NA, TRUE, TRUE, TRUE, TRUE, NA, TRU...
$ czycmgxdvd.intos           <lgl> NA, TRUE, NA, TRUE, NA, NA, TRUE, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, TRUE, NA, NA, NA, NA, TRUE, NA, NA, NA, NA, NA, NA, TRUE, T...

So everything looks good, I see some data.
Here's a very basic command that works:

> order_discrepancy %>% filter(inorder == TRUE) %>% head()
# Source:   lazy query [?? x 6]
# Database: Hive 1.2.2[@Hive/HIVE]
  czuqppwbbq.pro_shopper_id czuqppwbbq.tos_signed_date czuqppwbbq.uds_order_date czuqppwbbq.date_diff_days czuqppwbbq.inorder czuqppwbbq.intos
  <chr>                     <date>                     <date>                                        <int> <lgl>              <lgl>           
1 100319101                 NA                         2018-12-07                                       NA TRUE               NA              
2 101742363                 NA                         2018-12-07                                       NA TRUE               NA              
3 10290598                  NA                         2018-12-07                                       NA TRUE               NA              
4 103123012                 NA                         2018-12-07                                       NA TRUE               NA              
5 104398857                 NA                         2018-09-17                                       NA TRUE               NA              
6 10536771                  NA                         2018-12-09                                       NA TRUE               NA 

So far so good. But, look what happens if I add select command:

> order_discrepancy %>% filter(inorder == TRUE) %>% select(pro_shopper_id:inorder) %>% head()
Error in is_character(x, encoding = encoding, n = 1) : 
  object 'pro_shopper_id' not found

So, then I tried just selecting field 'inorder' which I just successfully filtered on in the block 2 above:

> order_discrepancy %>% filter(inorder == TRUE) %>% select(inorder) %>% head()
Error in .f(.x[[i]], ...) : object 'inorder' not found

Interestingly, if I remove the filter condition and try using select only, I receive the same error message. I also tried:

> order_discrepancy %>% mutate(blah = 1) %>% head()
Error in new_result(connection@ptr, statement) : 
  nanodbc/nanodbc.cpp:1344: 42000: [Hortonworks][Hardy] (80) Syntax or semantic analysis error thrown in server while executing query. Error message from server: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'zzz1.pro_shopper_id': (possible column names are: pro_shopper_id, tos_signed_date, uds_order_date, date_diff_days, inorder, intos)

Here is the result of show_query() command for the above code blocks that failed:

> order_discrepancy %>% mutate(blah = 1) %>% head() %>% show_query()
<SQL>
SELECT `zzz1.pro_shopper_id`, `zzz1.tos_signed_date`, `zzz1.uds_order_date`, `zzz1.date_diff_days`, `zzz1.inorder`, `zzz1.intos`, 1.0 AS `blah`
FROM (select * 
                                  from gcameron.pro_programe_compare_uds_order
                                  where (inorder is null) or (intos is null)) `tgzubexmdm`
LIMIT 6

This one won't show the query but I tried:

> order_discrepancy %>% select(inorder) %>% head() %>% show_query()
Error in .f(.x[[i]], ...) : object 'inorder' not found

Here's the one that does actually work:

> order_discrepancy %>% filter(inorder == TRUE) %>% head() %>% show_query()
<SQL>
SELECT *
FROM (select * 
                                  from gcameron.pro_programe_compare_uds_order
                                  where (inorder is null) or (intos is null)) `smwemchtqz`
WHERE (`inorder` = TRUE)
LIMIT 6

So, it looks like I can use some commands, e.g. filter, but not others.

In case it's relevant, here is my .odbc.ini configuration with our company data hidden:

[ODBC]
# Specify any global ODBC configuration here such as ODBC tracing.

[ODBC Data Sources]
HiveProd=Hortonworks Hive ODBC Driver 64-bit

[HiveProd]

# Description: DSN Description.
# This key is not necessary and is only to give a description of the data source.
Description=Hortonworks Hive ODBC Driver (64-bit) DSN

# Driver: The location where the ODBC driver is installed to.
Driver=/usr/lib/hive/lib/native/Linux-amd64-64/libhortonworkshiveodbc64.so

# When using No Service Discovery, specify the IP address or host name of the Hive server.
# When using ZooKeeper as the Service Discovery Mode, specify a comma-separated list of ZooKeeper
# servers in the following format:
#	<zk_host1:zk_port1>,<zk_host2:zk_port2>,...
HOST=[our hosts here]

# The TCP port Hive server is listening. This is not required when using ZooKeeper as the service
# discovery mode as the port is specified in the HOST connection attribute.
PORT=

# The name of the database schema to use when a schema is not explicitly specified in a query.
Schema=default

# Set to 0 to when connecting directory to Hive Server 2 (No Service Discovery).
# Set to 1 to do Hive Server 2 service discovery using ZooKeeper.
# Note service discovery is not support when using Hive Server 1.
ServiceDiscoveryMode=1

# The namespace on ZooKeeper under which Hive Server 2 znodes are added. Required only when doing
# HS2 service discovery with ZooKeeper (ServiceDiscoveryMode=1).
ZKNamespace=/hive/hiveserver2

# Set to 1 if you are connecting to Hive Server 1. Set to 2 if you are connecting to Hive Server 2.
HiveServerType=2

# The authentication mechanism to use for the connection.
#   Set to 0 for No Authentication
#   Set to 1 for Kerberos
#   Set to 2 for User Name
#   Set to 3 for User Name and Password
# Note only No Authentication is supported when connecting to Hive Server 1.
AuthMech=1

# The Thrift transport to use for the connection.
#	Set to 0 for Binary
#	Set to 1 for SASL
#	Set to 2 for HTTP
# Note for Hive Server 1 only Binary can be used.
ThriftTransport=1

# When this option is enabled (1), the driver does not transform the queries emitted by an
# application, so the native query is used.
# When this option is disabled (0), the driver transforms the queries emitted by an application and
# converts them into an equivalent from in HiveQL.
UseNativeQuery=0

# Set the UID with the user name to use to access Hive when using AuthMech 2 to 8.
UID=

# The following is settings used when using Kerberos authentication (AuthMech 1 and 10)

# The fully qualified host name part of the of the Hive Server 2 Kerberos service principal.
# For example if the service principal name of you Hive Server 2 is:
#   hive/myhs2.mydomain.com@EXAMPLE.COM
# Then set KrbHostFQDN to myhs2.mydomain.com
KrbHostFQDN=hive.hadoop.ourdomain.com

# The service name part of the of the Hive Server 2 Kerberos service principal.
# For example if the service principal name of you Hive Server 2 is:
#   hive/myhs2.mydomain.com@EXAMPLE.COM
# Then set KrbServiceName to hive
KrbServiceName=hive

# The realm part of the of the Hive Server 2 Kerberos service principal.
# For example if the service principal name of you Hive Server 2 is:
#   hive/myhs2.mydomain.com@EXAMPLE.COM
# Then set KrbRealm to EXAMPLE.COM
KrbRealm=HADOOP.PROD.INT.OURNAME.COM

# Set to 1 to enable SSL. Set to 0 to disable.
SSL=0

# Set to 1 to enable two-way SSL. Set to 0 to disable. You must enable SSL in order to
# use two-way SSL.
TwoWaySSL=0

# The file containing the client certificate in PEM format. This is required when using two-way SSL.
ClientCert=

# The client private key. This is used for two-way SSL authentication.
ClientPrivateKey=

# The password for the client private key. Password is only required for password protected
# client private key.
ClientPrivateKeyPassword=

Any information to help overcome these issues would be very much appreciated


#2

Hi, yes, it seems that the root cause is using a SQL query to do some pre-filtering. This makes dbplyr assign a name to your query, so your fields become prefixed with the name of the table, for instance, order_id is now asldkjfa.order_id, and that's why it can't be found

I'd recommend to use just the name of the table in your tbl() command, and then filter the nulls using filter()


#3

Hi thanks for the suggestion however using table names and not sql queries seems to introduce a new problem (it's what led me to use sql queries initially).

Example:

uds_order_122018 <- tbl(con, in_schema("my_schema", "uds_order")) %>% 
  filter(order_date == "2018-12-01")

Then:

> glimpse(uds_order_122018)
Observations: ??
Variables: 188
$ uds_order.order_id                                      <chr> "1366153303", "1382193988", "1382343850", "1383961850", "1386257046", "1386404658", "1387493299", "13883...
$ uds_order.row_id                                        <int> 0, 1, 0, 5, 1, 0, 16, 0, 1, 0, 0, 4, 6, 5, 10, 0, 4, 1, 0, 0, 2, 0, 1, 0, 2, 0, 8, 2, 0, 1, 0, 0, 1, 5, ...
$ uds_order.order_ts ...
$ lots of other fields here

However, if I try adding a select condition:

> uds_order_122018 <- tbl(con, in_schema("my_schema", "uds_order")) %>% 
+   filter(order_date == "2018-12-01") %>% 
+   select(order_id:order_date, private_label_id)
> glimpse(uds_order_122018)
Error in is_character(x, encoding = encoding, n = 1) : 
  object 'order_id' not found

I tried adding the schema name to the table name in select like this:

> uds_order_122018 <- tbl(con, in_schema("my_schema", "uds_order")) %>% 
+   filter(order_date == "2018-12-01") %>% 
+   select(uds_order.order_id)
> glimpse(uds_order_122018)
Error in .f(.x[[i]], ...) : object 'uds_order.order_id' not found

So using the table names seems to work only with the initial dplyr::filter() condition but after that I hit errors. Is this expected? Any ideas on how to overcome? Thanks for your feedback thus far!


#4

Ok, when you let the auto-complete in select() run, what field names does it return?


#5

Hi, I checked that just now but autocomplete does not show any field names when using in this way.

Example:
mtcars %>% select(mpg) // autocomplete shows a drop down with mpg

However:
uds_order_122018 %>% select(... // no field names in drop down

If you look at my glimpse() code above, one field is called uds_order.row_id. If I try typing uds I just get:

Then if I try typing "row" in the hope of seeing field "row_id" I get:

So just r functions but no field names.

Any ideas on where to go from here or how to be able to use dplyr verbs with a tbl odbc connection?


#6

Would it be appropriate for me to add this as an issue on Github?


#7

Can we take a look a what show_query() returns for the code I quoted returns?


#8

Hi, here is the result of show query:

> order_discrepancy %>% filter(inorder == TRUE) %>% head() %>% show_query()
<SQL>
SELECT *
FROM (select * 
                                  from gcameron.pro_programe_compare_uds_order
                                  where (inorder is null) or (intos is null)) `euamuphige`
WHERE (`inorder` = TRUE)
LIMIT 6

And just for completeness here's the result:

> order_discrepancy %>% filter(inorder == TRUE) %>% head() 
# Source:   lazy query [?? x 6]
# Database: Hive 1.2.2[@Hive/HIVE]
  qyujrgplpw.pro_shopper_id qyujrgplpw.tos_signed_date qyujrgplpw.uds_order_date qyujrgplpw.date_diff_days qyujrgplpw.inorder qyujrgplpw.intos
  <chr>                     <date>                     <date>                                        <int> <lgl>              <lgl>           
1 1232411                 NA                         2018-12-19                                       NA TRUE               NA              
2 567812                 NA                         2018-12-19                                       NA TRUE               NA              
3 123abc                  NA                         2018-12-19                                       NA TRUE               NA              
4 324678                 NA                         2018-09-17                                       NA TRUE               NA              
5 107888                 NA                         2018-12-19                                       NA TRUE               NA              
6 107777                 NA                         2018-12-19                                       NA TRUE               NA      

But...

> order_discrepancy %>% select(pro_shopper_id) %>% head()
Error in .f(.x[[i]], ...) : object 'pro_shopper_id' not found
> order_discrepancy %>% select(pro_shopper_id) %>% head() %>% show_query()
Error in .f(.x[[i]], ...) : object 'pro_shopper_id' not found

Also

> order_discrepancy %>% mutate(blah = 1) %>% head()
Error in new_result(connection@ptr, statement) : 
  nanodbc/nanodbc.cpp:1344: 42000: [Hortonworks][Hardy] (80) Syntax or semantic analysis error thrown in server while executing query. Error message from server: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table alias or column reference 'zzz3.pro_shopper_id': (possible column names are: pro_shopper_id, tos_signed_date, uds_order_date, date_diff_days, inorder, intos) 
> order_discrepancy %>% mutate(blah = 1) %>% head() %>% show_query()
<SQL>
SELECT `zzz3.pro_shopper_id`, `zzz3.tos_signed_date`, `zzz3.uds_order_date`, `zzz3.date_diff_days`, `zzz3.inorder`, `zzz3.intos`, 1.0 AS `blah`
FROM (select * 
                                  from gcameron.pro_programe_compare_uds_order
                                  where (inorder is null) or (intos is null)) `hhjsrfszaz`
LIMIT 6

Let me know if there's any other information I can provide?


#9

Hi @ edgararuiz
After extensive Google searching, the only lead I have for hoping to solve this is our thread!
Does this look like a bug or do you think I've got an issue in my set up? I wonder if my .odbc.ini file would shed any light here or if it's irrelevant? Should I share it? Would you recommend I post an issue on dbplyr Github?


closed #10

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.