I have linked RStudio of our Hive instance using an ODBC driver.
Since doing so I have noticed a few things and wanted to ensure I'm doing this "right".
library(odbc)
library(tidyverse)
# hive connection
con <- dbConnect(odbc::odbc(), "myHive")
After running this I see a new connections pane open in Rstudio and all of the available schemas appear.
A few things:
- Pulling a tbl takes a very long time. My understanding is that tbl does not pull in all the databases' table till calling collect(). Don't quote me but somewhere out there I think I read that maybe ten rows are read in pre collect?
system.time(tbl(con, in_schema("somedb", "a_table")))
user system elapsed
0.025 0.000 33.048
Then
> system.time(tbl(con, in_schema("another_db", "a_big_table")))
user system elapsed
0.159 0.120 769.661
The creation of the tbl for the big table takes a long time. If tbl only pulls in a small number of rows till collect() is called, is it expected that it takes this long?
- If I use the dim() function on the tbl I'm returned NA for rows and, in this case, 72 for the dimension count. Is this expected? I then tried nrow(mytbl) and got NA. I then tried
my_tbl %>% collect() %>% nrow()
however that gave me a "cannot allocate a vector of memory" error. I then tried changing collect() to compute() since I read that this uses a temp table or in memory on the db to calculate but that returned another error "Error: <SQL> 'CREATE TEMPORARY TABLE uedjkbxfkf
AS SELECT zzz4.year_month
, zzz4.shopper_id
,...."
How can I get the count of rows / length of my_tbl?
- I wanted to check the unique values of field "year_month" in my_tbl so tried
unique(my_tbl$zzz4.year_month)
RStudio automatically suggested zzz4.year_month rather than just year_month. Why is that? Either way, both variations return NULL. How can I get back the unique instances of year_month from my_tbl?