RStudio consuming more and more RAM

arnyeinstein · October 19, 2022, 2:52pm

Hi
I am loading a tsv-file of 51 Mb. I am quite astonished that RStudio uses 8Gb of RAM for this.
When I keep working, this RAM usage even increases more and more, making RStudio slower and slower.

THis is the script (the tsv-file can be dowloaded here (a Eurostat data file).

library(tidyverse)
supply <- read.table("naio_10_cp15.tsv",   sep = '\t', header = TRUE, fill = TRUE)  %>%
  as_tibble() %>%
  pivot_longer(2:last_col(), names_to = "Year", values_to = "Value") %>%
  mutate(Value = as.numeric(Value),
         Year = as.numeric(str_replace(Year, "X","" ))) %>%
  rename(X1 = 1) %>%
  separate(X1, into = c("X1", "X2", "X3", "X4", "X5"), sep = ",")

Any idea what is causing this high usage?

CHeers
Renger

dvetsch75 · October 19, 2022, 3:39pm

I am going to venture a guess at pivot_longer being the source of your problems. The reason that I say this is that it can be easy to lose sight of how large of a data.frame pivot_longer is actually going to create. For example:

library(dplyr)
library(tidyr)

wide_df <- lapply(
    1:100,
    function(x) {
        runif(10)
    }
) %>% 
    bind_cols()

dim(wide_df)
#> [1]  10 100

# Total number of values in the dataframe
wide_df %>% dim %>% prod
#> [1] 1000
pivot_longer(wide_df, everything()) %>% dim %>% prod
#> [1] 2000

^{Created on 2022-10-19 by the reprex package (v1.0.0)}

This example illustrates that it may not be obvious that you are creating a much bigger dataset.

So, I think to get to the bottom of your issue: what were the dimensions of supply after calling as_tibble() versus the dimensions after calling pivot_longer?

Flm · October 19, 2022, 3:55pm

I just tried:
An empty project in my Mac uses 90Mb, then I loaded your script and RAM increased up to ~9Gb. Running the script took about 2 minutes, but it is acceptable, it is a big file (11901408 rows, 7 variables).
Once the dataframe has been imported, you can use subsets to speed up operations and scale to large only at the end, for example:

small <- head(supply, 3000)

arnyeinstein · October 19, 2022, 6:30pm

Thanks. But this doesn't seem to be the issue: if I save the data in a RDS-file and load them in a new session, RAM is around 1.4 Gb instead of 8 Gb.
I tested a little bit more:

I tried garbage cleaning with gc(), but this had no effect.
If I remove the long table with rm(supply), the RAM usage gets a little bit less, but still around 7.5 Gb.
If I repeat the loading and calculations and assign it to supply2, the RAM usage goes up to 10 Gb.

I also ran the script in the Rgui: same problem, so it seems that tidyverse is causing the immense use of RAM.

arnyeinstein · October 20, 2022, 12:22pm

I have found the "culprit". The separate command is taking most of the time and causes the increase in RAM.
It is still weird, however, that after finishing the command R still uses so much memory and it is not freed.
Perhaps someone could explain that

martin.R · October 20, 2022, 12:42pm

I haven't got a specific answer, but some operations copy the entire dataframe in memory more than once because the functions are not written to minimise memory requirements.

If you wish to minimise memory overhead try using data.table. tstrsplit()is the equivalent of separate().

arnyeinstein · October 20, 2022, 12:46pm

Thanks, it looks like separate (or str_split) are producing huge lists (I used str_split on the vector and it produced a list of 4.5 Gb). Funny is that it seems that the temporary list seems t o remain in memory and you can't get rid of it..
Thanks for the advice on data.table. I will see how that compares to separate.

andresrcs · October 20, 2022, 1:00pm

I think you should file an issue report

arnyeinstein · October 20, 2022, 1:55pm

I submitted an issue.

system · November 10, 2022, 1:55pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.