Memory usage and R's global string pool


#1

I’m working on a project where I’m manipulating a lot of string data. I’ve noticed as I work, my R session starts to use more and more memory according to linux’s top command, but that from within R (using gc() or pryr::mem_used()) I don’t see any extra memory usage.

I’ve read the Memory chapter in Advanced R, so I know that there is a global string pool, but I can’t find any resources about investigating it or clearing it. Does anyone here have any pointers?

I did find this Stackoverflow post, but here’s some code that I think proves the answer wrong. This code makes a large data.frame with a lot of unique strings and shows the memory output I’m seeing from top and R. I think this confirms that not all of the memory from the strings is returned because the top command doesn’t show a decrease at the end.

NUM_LETTERS <- 10
NUM_ROWS <- 1e5
NUM_COLS <- 50


big_char_df <- lapply(seq_len(NUM_COLS), function(c) {
  vapply(
    seq_len(NUM_ROWS), 
    function(r) paste(sample(letters, NUM_LETTERS, TRUE), collapse = ""),
    ""
  )
})
names(big_char_df) <- paste0("var", seq_len(NUM_COLS))
big_char_df <- as.data.frame(big_char_df, stringsAsFactors = FALSE)
gc()
format(object.size(big_char_df), units = "Mb")

rm(big_char_df)
gc()


# Return from gc() before removing
#            used  (Mb) gc trigger  (Mb) max used  (Mb)
# Ncells  5434692 290.3    8273852 441.9  7530900 402.2
# Vcells 19952745 152.3   30316406 231.3 30316406 231.3

# Return from object.size
# [1] "305.2 Mb"

# From top command before removing
# (VIRT) 1272.5m (RES) 713.8m  (SHR) 25.5m

# Return from gc() after removing
#           used (Mb) gc trigger  (Mb) max used  (Mb)
# Ncells  527651 28.2    6619081 353.5  7795622 416.4
# Vcells 5131561 39.2   24253124 185.1 30316406 231.3

# From top command after removing
# (VIRT) 1272.5m (RES) 713.9m  (SHR) 25.6m

#2

Just tried this from my home computer, and it doesn’t appear to affect macOS, but at work it affected both Linux and Windows machines.


#3

Is this memory causing an issue on the machines where you are running this?

It would help us help you if you used a reprex . ( https://www.tidyverse.org/help/ ) with embedded comments rather than pasting code directly with separate comments. reprex’s are not only useful for asking questions they are useful in the development process because they make it easier to create reusable snippets of code to try things out.

Here is a reprex based on your example with embedded comments… it’s a lot easier to see exactly when in the code measurements are being made and it makes it lot easier for us to duplicate what you are doing.

This was run on macOS that shows that removing big_char_df returns the memory it was using. The example starts off using 35MB of memory, then 468MB of memory of the building of big_char_df, then back to 69MB after big_char_df is removed… according to plyr::mem_used.

Because of layers of cache and virtual memory it might take a while before the OS realizes the memory is available… in fact it might not know that the memory is available until it tries to allocate some more memory. And of course it’s also possible that R has not actually freed up the memory. That’s why I asked of the memory usage of this example was causing problems for other applications.

NUM_LETTERS <- 10
NUM_ROWS <- 1e5
NUM_COLS <- 50

# initial memory usage
pryr::mem_used()
#> 34.6 MB
gc()
#>          used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells 489139 26.2     940480 50.3   750400 40.1
#> Vcells 921423  7.1    1650153 12.6  1223402  9.4
big_char_df <- lapply(seq_len(NUM_COLS), function(c) {
    vapply(
        seq_len(NUM_ROWS),
        function(r) paste(sample(letters, NUM_LETTERS, TRUE), collapse = ""),
        ""
    )
})
names(big_char_df) <- paste0("var", seq_len(NUM_COLS))

big_char_df <- as.data.frame(big_char_df, stringsAsFactors = FALSE)
# after big_char_df memory usage
pryr::mem_used()
#> 468 MB
gc()
#>            used  (Mb) gc trigger  (Mb) max used  (Mb)
#> Ncells  5490142 293.3    8273852 441.9  6573678 351.1
#> Vcells 20052566 153.0   31186712 238.0 25922260 197.8
# after gc complete
pryr::mem_used()
#> 468 MB
# size big_char_df
format(object.size(big_char_df), units = "Mb")
#> [1] "305.2 Mb"
# pryr check of size of big_char_df
format(pryr::object_size(big_char_df), units = "Mb")
#> [1] "320005712"
rm(big_char_df)
# after big_char_df removed
pryr::mem_used()
#> 67.9 MB
gc()
#>           used (Mb) gc trigger  (Mb) max used  (Mb)
#> Ncells  490904 26.3    5295264 282.8  8273852 441.9
#> Vcells 5054091 38.6   19959495 152.3 29133986 222.3
# after 2nd gc
pryr::mem_used()
#> 67.9 MB

#4

Thanks for taking a look!

Yeah, I struggled with the reprex part because my main point is about the output from top which doesn’t get captured because it’s not R code. So I feel that your post misses the point, the output from R (which is captured by reprex) is the same regardless of what OS I use. However, top shows high memory usage on linux even after removing the data.frame.

I have seen sporadic problems and slowdowns in other applications, and I believe they are related to the high memory use, but these things are hard to make minimal reproducible examples of, so that’s why I focused on the output from top. Do you have ideas for other things to test?


#5

When you say that it affects Linux and Windows but does not affect macOS what affect are to referring to? App’s slow down on Linux and WIndows but not on macOS?

I don’t know the details of the impl of R so I don’t know how it implements garbage collection so I’m not sure where you might look next.

Just try to directly verify that R is what is slowing down those other apps before chasing that rabbit :grinning:

Btw pryr::mem_used() calls gc() and returns the sum of the first column that gc() produces.