Rstudio, is it useable for large data sets (9gb+)?

jdlong · January 9, 2019, 1:01am

This is a really common question and I think we might need to write up a FAQ because it's both important and frequently asked.

Typically when asked online you get answers like what you see here which reduce to:

use data.table
use a database
"chunk" your data (in combination with a database and/or with base R)

But this lacks enough color to help you understand why these work or when to use each approach.

As a pedantic point, RStudio is not consuming your RAM, it's R which is. RStudio is just the IDE we're using to control R.

So in R (and Python and many other languages as well) you can only operate on as much data as they can fit in memory. Which feels, at first glance, like a tough constraint. "I want to operate on 8GB of data but only have 8GB of RAM. I'm screwed." But, alas, it's not that simple. If you think about your data flow it likely has (at least) two major parts:

Munging the data to get it ready for analysis. Things like joins and normalization.
The actual analysis (typically on subsets of the data). This is often things like regressions or some other model.

Step 1 often does not need to take place in memory with R. I'm a big fan of using databases for this step and controlling them with dplyr in R. Others prefer data.table which (as I understand it) copies data into memory but does not copy it around as you make changes. So, you have to have enough RAM to hold your data, but if it fits, data.table is really fast.

If your work involves moving on to step 2 (building a model) you'll need to fit the model data into RAM to use R. It's frequently the case that we want to fit a model to subsets of the data. So we might want a different model for each geographic bucket (US state for example). Or we might have a model with a dummy variable, like sex. Well "dummy variable" (or "one hot encoding" for all y'all under 40) is just another way of saying, "building models on subsets. So instead of having a dummy, we can just subset the model into different subsets and build the model on the subsets (one model for male, one for female, for example). In those cases the only data we need to fit into RAM is the subset. A common design pattern is to keep all the data in a database and to bring it "one chunk in a time" into R and then write results back to the database.

Every night the system I work with drops 4 billion records with dozens of fields into Amazon Redshift. And I use R to access that data, do calculations, build models, and write results into SQL server. No problems at all. But the data exceeds my available RAM by a HUGE amount. But I use workflows that allow me to only read into R subsets or aggregated data which I need to work on.