Send hiveconf's with odbc hive query

I have a hive query with hive configurations:

-- Hive configurations
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions=10000;
set mapreduce.map.memory.mb=7168;
set mapreduce.reduce.memory.mb=7168;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.compress.output=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.execution.engine=mr;

-- Query
select 
  fruits,
  count(1) as n
from table
group by fruits;

This query works when I run it directly on our EMR cluster. I would like to run it from within R using my odbc hive_conn.

I can already make simple queries using my hive conn with odbc, so I know that my conn is fine.

However, for this particular query I need to run it with the configurations otherwise the query does not work.

I tried a few things:

# Send the hive configurations with dbSendQuery()
# where rawd_query is the sql above including the hive configurations
rawd <- DBI::dbGetQuery(hive_conn, rawd_query)

This runs more or less instantly and returns an empty data frame. The same query called directly on our cluster would take a couple of minutes and will return many thousands of records.

I tried moving the hive configurations into a separate call with DBI::dbExcecute()

hiveconfs <- "
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions=10000;
set mapreduce.map.memory.mb=7168;
set mapreduce.reduce.memory.mb=7168;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.compress.output=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.execution.engine=mr;
"
DBI::dbExcecute(hive_conn, hiveconfs)

This runs and returns a 0. I assume this means a successful run since no error message is returned.

I then tried to run DBI::dbGetQuery() except without the hive configurations and just the sql part starting with select:

rawd <- DBI::dbGetQuery(hive_conn, rawd_query)

This time the query runs but without the configurations being applied. For example, when I check my running query on our cluster I see that it is being executed with tez engine, the configuration set hive.execution.engine=mr has not been applied.

Is it possible to set hive configurations before running a hive command using odbc and DBI?

[edit] ----
Adding any info that I think could possibly be helpful. Here is a sanitized copy of my hive odbc settings, in case it's relevant:

[Hive]
Description=Simba Hive ODBC Driver (64-bit)
Driver=/opt/amazon/hiveodbc/lib/64/libamazonhiveodbc64.so
AwsRegion=us-east-1
Schema=default
AuthenticationType=Instance Profile
LogPath=/var/log/
ErrorMessagesPath=/opt/amazon/hiveodbc/ErrorMessages/
LogLevel=0
LogPath=4
SwapFilePath=/tmp
UID=bi_shared
Host=emr.bi.aws.us.examplecompany.com
Port=10000

Just some new information.

I found a reference to someone else doing this with rodbc over here: https://www.develop-bugs.com/article/10681393/dbGetQuery+and+dbReadTable+failing+to+return+a+really+large+table+DBI

There's a line over there of the form:

dbSendUpdate(conn, "set hive.resultset.use.unique.column.names=false")

Since I'm using odbc not rodbc, I don't have an option to use dbSendUpdate(). Not sure if I'm seeking the odbc equivilent of dbSendUpdate() but that looks like what I want.

Found similar here: https://stackoverflow.com/questions/30737141/run-hive-query-in-r-with-config

Looks like another way to ask my question could be:
"What is the odbc equivilent of RJDBC::dbSendUpdate(conn, statement)"?

I was able to get this figured out. In case anyone else lands here looking for a way to set hive configurations with odbc.

I'm using the Simba driver but I found similar documentation on the cloudera one: https://www.simba.com/products/Hive/doc/ODBC_InstallGuide/linux/content/odbc/hi/configuring/serverside.htm

The key search term I was looking for is 'configure server side properties'.

You have to set these settings when you create the conn object. You can put them in your odbc.ini file or when you create the conn object. You prepend the string 'SSP_' for server side property then add them as name value pairs.

Example:

hive_conn <- dbConnect(odbc(), 
                       dsn = "Hive", 
                       SSP_hive.execution.engine = "mr",
                       SSP_hive.exec.dynamic.partition.mode = "nonstrict",
                       SSP_hive.exec.dynamic.partition = "true",
                       SSP_hive.exec.max.dynamic.partitions = 10000,
                       SSP_mapreduce.map.memory.mb = 7168,
                       SSP_mapreduce.reduce.memory.mb = 7168,
                       SSP_hive.exec.max.dynamic.partitions.pernode = 10000,
                       SSP_hive.exec.compress.output = "true",
                       SSP_mapred.output.compression.codec = "org.apache.hadoop.io.compress.SnappyCodec"
                       )

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.