I have a hive query with hive configurations:
-- Hive configurations
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions=10000;
set mapreduce.map.memory.mb=7168;
set mapreduce.reduce.memory.mb=7168;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.compress.output=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.execution.engine=mr;
-- Query
select
fruits,
count(1) as n
from table
group by fruits;
This query works when I run it directly on our EMR cluster. I would like to run it from within R using my odbc hive_conn.
I can already make simple queries using my hive conn with odbc, so I know that my conn is fine.
However, for this particular query I need to run it with the configurations otherwise the query does not work.
I tried a few things:
# Send the hive configurations with dbSendQuery()
# where rawd_query is the sql above including the hive configurations
rawd <- DBI::dbGetQuery(hive_conn, rawd_query)
This runs more or less instantly and returns an empty data frame. The same query called directly on our cluster would take a couple of minutes and will return many thousands of records.
I tried moving the hive configurations into a separate call with DBI::dbExcecute()
hiveconfs <- "
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions=10000;
set mapreduce.map.memory.mb=7168;
set mapreduce.reduce.memory.mb=7168;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.compress.output=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.execution.engine=mr;
"
DBI::dbExcecute(hive_conn, hiveconfs)
This runs and returns a 0. I assume this means a successful run since no error message is returned.
I then tried to run DBI::dbGetQuery()
except without the hive configurations and just the sql part starting with select:
rawd <- DBI::dbGetQuery(hive_conn, rawd_query)
This time the query runs but without the configurations being applied. For example, when I check my running query on our cluster I see that it is being executed with tez engine, the configuration set hive.execution.engine=mr
has not been applied.
Is it possible to set hive configurations before running a hive command using odbc and DBI?
[edit] ----
Adding any info that I think could possibly be helpful. Here is a sanitized copy of my hive odbc settings, in case it's relevant:
[Hive]
Description=Simba Hive ODBC Driver (64-bit)
Driver=/opt/amazon/hiveodbc/lib/64/libamazonhiveodbc64.so
AwsRegion=us-east-1
Schema=default
AuthenticationType=Instance Profile
LogPath=/var/log/
ErrorMessagesPath=/opt/amazon/hiveodbc/ErrorMessages/
LogLevel=0
LogPath=4
SwapFilePath=/tmp
UID=bi_shared
Host=emr.bi.aws.us.examplecompany.com
Port=10000