Rstudio, is it useable for large data sets (9gb+)?

Hello everyone,

Please excuse me if this is the wrong place for this, but I have a very general question. Is R studio useful when writing code used to work large tables? The reason I ask is that R- studio seems to hold on to ram, even after a task is complete. Therefore, while working on a PC with a 4.0ghz hexacore and 32gb of ram, I can subset from 9gb table a few times, plot some data and R studio either will crash or cannot allocate the ram that it then requires. I have seen other reports of this- using gc() and restarting R constantly or installing a previous version is often reported to help or solve the issue. My question-am I missing something? The impression I get is that I might be using R for the wrong reasons and should simply learn a different language for large data or learn to drop the interface.

Thanks,

Luis

If you are going to allocate those 9GB in memory I think the language you choose is not going to make a big difference, but if you use a different approch like putting your data into a sql database (or other on-disk solution) and you use tools like dbplyr for manipulating (wrangling) your data in the sql server you can use R and Rstudio without problems.

3 Likes

You definitely can use R for very large data sets, but you need to be aware of when objects are copied which is when you will face problems. (I have worked with well over 9GB on a 32GB RAM laptop).

To minimise such problems you could use the data.table package which is highly memory efficient.

That said, you will hit a limit at some stage with RAM. You could then use a database to store and retrieve data as mentioned by andresrcs.

2 Likes

Thank you for the advice, I did not think of that.

L

That you for the response. I will give this a shot and see what happens.

L

Also see this recent thread where various options for large/out of memory data processing is discussed:

This is a really common question and I think we might need to write up a FAQ because it's both important and frequently asked.

Typically when asked online you get answers like what you see here which reduce to:

  1. use data.table
  2. use a database
  3. "chunk" your data (in combination with a database and/or with base R)

But this lacks enough color to help you understand why these work or when to use each approach.

As a pedantic point, RStudio is not consuming your RAM, it's R which is. RStudio is just the IDE we're using to control R.

So in R (and Python and many other languages as well) you can only operate on as much data as they can fit in memory. Which feels, at first glance, like a tough constraint. "I want to operate on 8GB of data but only have 8GB of RAM. I'm screwed." But, alas, it's not that simple. If you think about your data flow it likely has (at least) two major parts:

  1. Munging the data to get it ready for analysis. Things like joins and normalization.
  2. The actual analysis (typically on subsets of the data). This is often things like regressions or some other model.

Step 1 often does not need to take place in memory with R. I'm a big fan of using databases for this step and controlling them with dplyr in R. Others prefer data.table which (as I understand it) copies data into memory but does not copy it around as you make changes. So, you have to have enough RAM to hold your data, but if it fits, data.table is really fast.

If your work involves moving on to step 2 (building a model) you'll need to fit the model data into RAM to use R. It's frequently the case that we want to fit a model to subsets of the data. So we might want a different model for each geographic bucket (US state for example). Or we might have a model with a dummy variable, like sex. Well "dummy variable" (or "one hot encoding" for all y'all under 40) is just another way of saying, "building models on subsets. So instead of having a dummy, we can just subset the model into different subsets and build the model on the subsets (one model for male, one for female, for example). In those cases the only data we need to fit into RAM is the subset. A common design pattern is to keep all the data in a database and to bring it "one chunk in a time" into R and then write results back to the database.

Every night the system I work with drops 4 billion records with dozens of fields into Amazon Redshift. And I use R to access that data, do calculations, build models, and write results into SQL server. No problems at all. But the data exceeds my available RAM by a HUGE amount. But I use workflows that allow me to only read into R subsets or aggregated data which I need to work on.

10 Likes

:rofl::rofl::rofl::rofl::rofl: this was just precious. Just a small nitpick: even though they're conceptually exactly the same thing, usually the data structure used to encode a dummy variable is a vector of integers (i.e., a factor), while "one hot encoding" corresponds to using a boolean matrix to perform the same encoding. It's the reason why we have the functions to_categorical() and sparse_categorical_crossentropy() in keras. It's definitely a minor issue, but people coming from Statistical Learning/GLMs usually can spend a couple minutes parsing the average Deep Learning library error message, before realizing that class labels are expected to be encoded as Boolean matrices rather than integer vectors, when using categorical_crossentropy() (or its equivalent) as a loss function

This is pretty amazing! I also sometimes model datasets which are multiple TBs in size, but I admit I slacked off and chose to use Apache Drill + python + feather/arrow in the end, rather than spending time to learn how to do it in R. Would you be willing to write a blog post detailing your workflow? Also, do you have any comments on Redshift vs BigQuery? Any reasons to prefer one over the other?

1 Like

I think I may be rapidly becoming one of the "grey beards" I recall from my youth!

Totally fair point on the one hot nitpick. It's an implementation detail I totally glossed past.

This is pretty amazing! I also sometimes model datasets which are multiple TBs in size, but I admit I slacked off and chose to use Apache Drill + python + feather/arrow in the end, rather than spending time to learn how to do it in R.

I will try to cook up some examples for a blog post. But conceptually what I'm doing is I do three things in the database:

  1. Simple transforms (case_when) or something slightly more complex like normalizing by group.
  2. Subset the data to only the bits I need to use (filter)
  3. Aggregate the data in some way (group_by)

I do all that with dplyr and a database back end. That way I don't have to "context switch" back and forth from SQL to R. Just staying in R makes my life so much easier.

Often by the time I'm done doing those three things my data is small enough to just suck back into R with a collect statement. If it's not, I'll pull back a table with only the groupings. Then I'll locally iterate over the groupings and pull the data back into R one grouping at a time. Then I can operate locally on one grouping and do whatever I need to do, build a model, or something else I can't do in SQL.

Also, do you have any comments on Redshift vs BigQuery? Any reasons to prefer one over the other?

I'm always quick to tell people, "it depends on exactly what you are optimizing or trying to do". Well in our case we had had a Action Matrix cluster for a few years. Redshift was built when Amazon bought the source for Matrix and forked it. Action was discontinuing Matrix and we could very easily migrate to Redshift. And we knew it would work.... and so it wasn't a hard decision :slight_smile:

2 Likes

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.