Behavior of a tbl based on a odbc connection - slow and some functions not returning expected results

I have linked RStudio of our Hive instance using an ODBC driver.

Since doing so I have noticed a few things and wanted to ensure I'm doing this "right".

    # hive connection
    con <- dbConnect(odbc::odbc(), "myHive")

After running this I see a new connections pane open in Rstudio and all of the available schemas appear.

A few things:

  1. Pulling a tbl takes a very long time. My understanding is that tbl does not pull in all the databases' table till calling collect(). Don't quote me but somewhere out there I think I read that maybe ten rows are read in pre collect?
system.time(tbl(con, in_schema("somedb", "a_table")))
   user  system elapsed 
  0.025   0.000  33.048


> system.time(tbl(con, in_schema("another_db", "a_big_table")))
   user  system elapsed 
  0.159   0.120 769.661 

The creation of the tbl for the big table takes a long time. If tbl only pulls in a small number of rows till collect() is called, is it expected that it takes this long?

  1. If I use the dim() function on the tbl I'm returned NA for rows and, in this case, 72 for the dimension count. Is this expected? I then tried nrow(mytbl) and got NA. I then tried
my_tbl %>% collect() %>% nrow()

however that gave me a "cannot allocate a vector of memory" error. I then tried changing collect() to compute() since I read that this uses a temp table or in memory on the db to calculate but that returned another error "Error: <SQL> 'CREATE TEMPORARY TABLE uedjkbxfkf AS SELECT zzz4.year_month, zzz4.shopper_id,...."
How can I get the count of rows / length of my_tbl?

  1. I wanted to check the unique values of field "year_month" in my_tbl so tried

RStudio automatically suggested zzz4.year_month rather than just year_month. Why is that? Either way, both variations return NULL. How can I get back the unique instances of year_month from my_tbl?

Hi, yes, collect() will bring back everything on that table, you should use my_tbl %>% tally() instead

For this one, I would try this instead:

my_tbl %>%
   group_by(year_month) %>%

Wouldn't dplyr::distinct(year_month) be more efficient in this case?

I would assign the tbl() to a variable to keep take advantage of the laziness.

Yeah, that's how I usually start, and then want to see more stats from the groups, so it's easy for me to just add code inside summarise() :slight_smile:

Thanks for all of the info @edgararuiz! Just one follow up. " I would assign the tbl() to a variable to keep take advantage of the laziness.". What does laziness mean in this context? Yes, I'm storing any table I intend to use as a variable, but for one of the tbls it's drawing from a very large table in Hive. It takes 5 - 10 minutes to create and assign the tbl based on the odbc connection. Is this expected?

Right, if you do my_table <- tbl(con, in_schema("somedb", "a_table")) no SQL statement will actually be executed on the cluster. It's until after running something such as:

my_table %>%
   group_by(year) %>%

Since there's no assignment, it will assume that you want to print the results, so it'll execute the command and bring back the results, but at this time, the results are not at the atomic level, they are summarized

I understand now thanks. But actually, when I do my_table <- tbl(con, in_schema("somedb", "a_table")) just now in the console, it does actually take a long time to do whatever it's doing. By that I mean that the red dot appears to the top right of the console pane and I'm unable to write any other r code in the meantime. This is not so important for me right now, I was more just curious if I was following the intended use and flow of working with a direct database connection using dbplyr, which it sounds like I'm doing right. I might just need a little patience :slight_smile:

Oh right! It's pulling the column names that makes the autofill work. Forgot about that, sorry

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.