Bigrquery - large datasets

Hi, we're in the process of switching our data warehouse to Bigquery and overall it's going great and I'm enjoying using the bigrquery package.

The only issue I have is that if I write a query that has a large result set (i.e 40m records) it takes about an hour to download the data which is dramatically slower than if I save the output as a csv in google cloud storage and curl the result manually.

I've tried with and without writing to an intermediate table and also setting a much larger page size but nothing seems to make a difference. Would be really interested to know if anybody else has had experience of this or has worked out a way round it.

Thanks
Jacob

2 Likes

Interesting, I don't have a solution but I found something similar when importing data. It was much faster to upload multiple json files to google cloud storage and then import them into a table.

Interesting - which function were you using to write? Think there are a couple of options but dbWritetable is very fast - 10m records in about 2mins during my test!

How many columns are in that 40m set?

Can't quite remember, but currently I am using bigrquery to create the api call to load in the json files (https://github.com/iainmwallace/DataDepository/blob/master/R/loadJsonFiles2BQ.r).

Now that I think about it, the advantage for me of loading in json files was that the automatic schema detection worked better than loading in a csv file but I could have been doing something weird.

Hi, I'm just curious about the reason for downloading that volume of data? Is it modeling, visualization?

It's just 4 columns, total data size is about 1GB.

@edgar - this particular exercise is about visualisation, it's basically the forward looking availability by day for a massive amount of products.

This is probably the largest dataset I deal with in R on a regular basis but 10m ish isn't unusual when analysing transactional data or site traffic data.

And are you just using query_exec( ) from the bigrquery package? Are you processing this in RStudio on your local computer or on RServer?

Cool, maybe this article can help with your needs: http://db.rstudio.com/best-practices/visualization/. I'm also working on a package that will help you push some plot types calculations to the DB and collect the results (like the histogram's bins) for visualization, it is very WIP but maybe it can further assist: https://github.com/edgararuiz/dbplot

@Hoyt - yes I'm using query_exec() and it's on Rstudio server on a remote machine.

@edgararuiz - thanks that's interesting stuff. In this case I'm actually combining the data with that from another source then doing a little bit of modelling before writing a table back to the data warehouse - the actual visualisation happens in tableau (unfortunately).

Perhaps it may help but I have a big query package that extracts the data to GCS, for most cases bigrquery is by far superior but for edge cases like this it may help. http://code.markedmondson.me/bigQueryR/query.html It does as you specify with curl but via R, so not much improvement I guess.

1 Like

thanks @MarkeD - looks great, going to give it a go!

Hi JacobB,
I encountered exactly the same problem as you described it. Retrieving data in query_exec() takes about 2h for 1.2 GB. Have you found why this is the case and how to speed up things?

Thanks for your comments.

Hi Jess, it's basically a 'feature' of BQ - changing the page size to 50k or 100k helps but the only way to really get good speed is to export the table that your query creates to Google Cloud Storage and then download the CSV(s) into R data frames.

I responded on this issue how I automated this process https://github.com/r-dbi/bigrquery/issues/199

2 Likes