While looking at documentation on dplyr http://dbplyr.tidyverse.org/articles/dbplyr.html
I came across this quote advising when to use a database:
If your data fits in memory there is no advantage to putting it in a database: it will only be slower and more frustrating
I have nothing but profound respect for Hadley and without his contribution R wouldn't be half a great to use at it is - but here I think he's flat out wrong. Unfortunately, he isn't the only one.
I agree that this can be true for small datasets in which a database would have scanned the whole set anyway. But let's suppose that I happen to have 64 gb of RAM in which I load a 32 gb of data in memory. Now I want to apply a filter. My R code, using dplyr or not, is going to have to scan the entire dataset, row by row, and return a huge subset to me. Worse still, if I have 2 datasets each with 10 million rows and I decide to perform a join .. the row by row scanning to satisfy the join is going to be brutally slow.
On the other hand, database vendors have spent decades perfecting their engines to take maximize data retrieval efficiency with sophisticated query engines and indexes. The R code can submit a query to the database engine and retrieve only the results. Those results would have come back 1000's of times faster because the database engine has indexes that support things like the above-mentioned filter and joins.
The idea that something is always faster because it's in memory is simply not true. Saying this is bound to lead R coders in a wrong direction.
Where am I wrong here?