Handling Big Data, what tools when

JidduAlexander · September 26, 2017, 2:01pm

Hello,

I'm super interested to know how people decide what tools to use to tackle a specific situation. The situation I'd like to explore here is the following:

I have a large database with tables that does not fit into memory. I want to explore, clean (filter bad rows), apply models, visualise, and report the data.

Do you use dplyr to return subsets of data to use your standard (fit in memory) tidyverse tools?
Do you use Spark? (Is it actually possible to transfer a large database table direclty to Spark with Sparklyr?)
Buy more memory.

What is your plan of attack and why.

Best,
Jiddu

RobertMyles · September 26, 2017, 5:48pm

The (in-development) chunked read functions in readr look pretty interesting for this type of thing. Pandas in Python has these, and you can read in parts of a big file, do what you need, save it, and read the next part etc. I don't know how far along the chunked_* functions are, but they'll be a welcome addition, that's for sure.