I’m super interested to know how people decide what tools to use to tackle a specific situation. The situation I’d like to explore here is the following:
I have a large database with tables that does not fit into memory. I want to explore, clean (filter bad rows), apply models, visualise, and report the data.
- Do you use dplyr to return subsets of data to use your standard (fit in memory) tidyverse tools?
- Do you use Spark? (Is it actually possible to transfer a large database table direclty to Spark with Sparklyr?)
- Buy more memory.
What is your plan of attack and why.