Is this an indication of a memory leak in readr?

I am trying to use read_lines_chunked to read in a huge fixed width file in chunks. In my actual example I am doing something to the d object (hence SideEffectChunkCallback) but here for this reprex I am simply reading this in. I don't want it to return anything. And yet for whatever reason, R is holding on to a bunch of memory. My assumption is that that was the advantage of reading things in chunks - it doesn't hold them in memory. Am I misunderstanding what's happening here? Am I misunderstanding how to use read_lines_chunked? Why is R hanging on to that 2 MB? I know this is small but my sense is that that should be much closer to zero?

TIA

Sam

library(readr)
library(lobstr)
#> Warning: package 'lobstr' was built under R version 4.0.5
suppressPackageStartupMessages(library(gdata, warn.conflicts = FALSE))
#> Warning in system(cmd, intern = intern, wait = wait | intern,
#> show.output.on.console = wait, : running command 'C:\WINDOWS\system32\cmd.exe /c
#> ftype perl' had status 2
#> Warning in system(cmd, intern = intern, wait = wait | intern,
#> show.output.on.console = wait, : running command 'C:\WINDOWS\system32\cmd.exe /c
#> ftype perl' had status 2


## flexible fn to make fixed width
make_fwf <- function(nrows, file) {
  dat <- data.frame(
    x = runif(nrows),
    y = runif(nrows)
  )
  gdata::write.fwf(dat, file, colnames = FALSE)
  rm(dat)
  gc()
  file
}

fwf_sample <- make_fwf(1E6, "fwf-eg.fwf")

(start <- mem_used())
#> 60,273,208 B

f <- function(x, pos) {
  d <- read_fwf(fwf_sample, fwf_empty(fwf_sample, col_names = c("x", "y")), col_types = c("dd"))
  rm(d)
  gc()
}
read_lines_chunked(
  file = fwf_sample,
  callback = SideEffectChunkCallback$new(f),
  chunk_size = 50000,
  progress = FALSE
)
#> NULL
## Memory taken up
mem_used()
#> 62,609,880 B
## Memory added
mem_used() - start
#> 2,337,608 B


## Size of file
file.info(fwf_sample)$size
#> [1] 2.7e+07
Session info
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 4.0.4 (2021-02-15)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_Canada.1252         
#>  ctype    English_Canada.1252         
#>  tz       America/Los_Angeles         
#>  date     2021-04-13                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version date       lib source        
#>  backports     1.2.1   2020-12-09 [1] CRAN (R 4.0.3)
#>  cli           2.4.0   2021-04-05 [1] CRAN (R 4.0.4)
#>  crayon        1.4.1   2021-02-08 [1] CRAN (R 4.0.3)
#>  debugme       1.1.0   2017-10-22 [1] CRAN (R 4.0.2)
#>  digest        0.6.27  2020-10-24 [1] CRAN (R 4.0.3)
#>  ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.0)
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.0)
#>  fansi         0.4.2   2021-01-15 [1] CRAN (R 4.0.3)
#>  fs            1.5.0   2020-07-31 [1] CRAN (R 4.0.2)
#>  gdata       * 2.18.0  2017-06-06 [1] CRAN (R 4.0.4)
#>  glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.2)
#>  gtools        3.8.2   2020-03-31 [1] CRAN (R 4.0.3)
#>  highr         0.8     2019-03-20 [1] CRAN (R 4.0.0)
#>  hms           1.0.0   2021-01-13 [1] CRAN (R 4.0.3)
#>  htmltools     0.5.1.1 2021-01-22 [1] CRAN (R 4.0.3)
#>  knitr         1.31    2021-01-27 [1] CRAN (R 4.0.3)
#>  lifecycle     1.0.0   2021-02-15 [1] CRAN (R 4.0.4)
#>  lobstr      * 1.1.1   2019-07-02 [1] CRAN (R 4.0.5)
#>  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.0.3)
#>  pillar        1.5.1   2021-03-05 [1] CRAN (R 4.0.4)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.0)
#>  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.0.0)
#>  R.cache       0.14.0  2019-12-06 [1] CRAN (R 4.0.0)
#>  R.methodsS3   1.8.1   2020-08-26 [1] CRAN (R 4.0.2)
#>  R.oo          1.24.0  2020-08-26 [1] CRAN (R 4.0.2)
#>  R.utils       2.10.1  2020-08-26 [1] CRAN (R 4.0.2)
#>  R6            2.5.0   2020-10-28 [1] CRAN (R 4.0.3)
#>  Rcpp          1.0.6   2021-01-15 [1] CRAN (R 4.0.3)
#>  readr       * 1.4.0   2020-10-05 [1] CRAN (R 4.0.2)
#>  rematch2      2.1.2   2020-05-01 [1] CRAN (R 4.0.0)
#>  reprex        2.0.0   2021-04-02 [1] CRAN (R 4.0.5)
#>  rlang         0.4.10  2020-12-30 [1] CRAN (R 4.0.3)
#>  rmarkdown     2.7     2021-02-19 [1] CRAN (R 4.0.4)
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.0)
#>  stringi       1.5.3   2020-09-09 [1] CRAN (R 4.0.2)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.0.2)
#>  styler        1.4.1   2021-03-30 [1] CRAN (R 4.0.4)
#>  tibble        3.1.0   2021-02-25 [1] CRAN (R 4.0.4)
#>  utf8          1.2.1   2021-03-12 [1] CRAN (R 4.0.5)
#>  vctrs         0.3.7   2021-03-29 [1] CRAN (R 4.0.5)
#>  withr         2.4.1   2021-01-26 [1] CRAN (R 4.0.3)
#>  xfun          0.22    2021-03-11 [1] CRAN (R 4.0.4)
#>  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.0)
#> 
#> [1] C:/Users/salbers/R/win-library/4.0
#> [2] C:/Program Files/R/R-4.0.4/library

This is just mainly a "cost of doing business". Here's the memory involved simply by having the libraries loaded without doing any read or write operations. As far as the last 2MB that the routine just won't let go, when I ran your code and added mem_used - start() at the end, it seems to go down pretty quickly

> mem_used() - start
2,512 B
> mem_used() - start
694,384 B
library(lobstr)
gc()
#>           used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells  630527 33.7    1450414 77.5   881415 47.1
#> Vcells 1156304  8.9    8388608 64.0  1847276 14.1
mem_used()
#> 44,597,664 B
library(ggplot2)
mem_used()
#> 56,156,888 B
suppressPackageStartupMessages({
  library(gdata)
})
mem_used()
#> 56,662,288 B
sessionInfo()
#> R version 4.0.4 (2021-02-15)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Pop!_OS 20.10
#> 
#> Matrix products: default
#> BLAS:   /usr/local/lib/R/lib/libRblas.so
#> LAPACK: /usr/local/lib/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] gdata_2.18.0  ggplot2_3.3.3 lobstr_1.1.1 
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.6        pillar_1.6.0      compiler_4.0.4    highr_0.8        
#>  [5] tools_4.0.4       digest_0.6.27     evaluate_0.14     lifecycle_1.0.0  
#>  [9] tibble_3.1.0      gtable_0.3.0      pkgconfig_2.0.3   rlang_0.4.10     
#> [13] reprex_2.0.0      DBI_1.1.1         yaml_2.2.1        xfun_0.22        
#> [17] withr_2.4.1       styler_1.4.1      stringr_1.4.0     dplyr_1.0.5      
#> [21] knitr_1.31        gtools_3.8.2      generics_0.1.0    fs_1.5.0         
#> [25] vctrs_0.3.7       grid_4.0.4        tidyselect_1.1.0  glue_1.4.2       
#> [29] R6_2.5.0          fansi_0.4.2       rmarkdown_2.7     purrr_0.3.4      
#> [33] magrittr_2.0.1    backports_1.2.1   scales_1.1.1      ellipsis_0.3.1   
#> [37] htmltools_0.5.1.1 assertthat_0.2.1  colorspace_2.0-0  utf8_1.2.1       
#> [41] stringi_1.5.3     munsell_0.5.0     crayon_1.4.1

Right I see what you are saying. My problem is that when using read_lines_chunked on a huge flat file 3GB, I am able to process through the file but at the end I am left with a huge RAM load which doesn't correspond to any objects in the R environment. I think that maybe my reprex isn't capturing my situation correctly. I still think there is a memory leak somewhere and running the process to work with the 3GB file results in a ~3GB memory load. That seems counter to the intent of read_lines_chunked. I'll try to work on a better reprex.

Yeah, seems odd. One thing I've heard, from Hadley, is that R generally does a good job at garbage collection but that the OS doesn't always cooperate about picking it up and taking it back.

Did you see the update to dev readr yesterday that avoids a memory leak when reading chunks? Might be relevant.

1 Like

Thanks @mara! Do you have any idea of the release cycle of readr? As in do you know when this bug fix would hit CRAN?

@technocrat it was the compression that was causing my poor reprex. Have a look here:

library(readr)
library(lobstr)
#> Warning: package 'lobstr' was built under R version 4.0.5
suppressPackageStartupMessages(library(gdata, warn.conflicts = FALSE))
#> Warning in system(cmd, intern = intern, wait = wait | intern,
#> show.output.on.console = wait, : running command 'C:\WINDOWS\system32\cmd.exe /c
#> ftype perl' had status 2
#> Warning in system(cmd, intern = intern, wait = wait | intern,
#> show.output.on.console = wait, : running command 'C:\WINDOWS\system32\cmd.exe /c
#> ftype perl' had status 2

mem_used()
#> 51,673,568 B
# flexible fn to make fixed width
make_fwf <- function(nrows, file) {
  dat <- data.frame(
    x = runif(nrows),
    y = runif(nrows)
  )
  gdata::write.fwf(dat, file, colnames = FALSE)
  rm(dat)
  gc()

  R.utils::gzip(file)
}

fwf_sample <- make_fwf(1E6, "fwf-eg.fwf")

(start <- mem_used())
#> 61,211,352 B

f <- function(x, pos) {
  d <- read_fwf(fwf_sample, fwf_empty(fwf_sample, col_names = c("x", "y")), col_types = c("dd"))
  rm(d)
  gc()
}
read_lines_chunked(
  file = fwf_sample,
  callback = SideEffectChunkCallback$new(f),
  chunk_size = 50000,
  progress = FALSE
)
#> NULL
## Memory taken up
mem_used()
#> 1,713,272,496 B
## Memory added
mem_used() - start
#> 1,652,060,232 B


## Size of file
file.info(fwf_sample)$size
#> [1] 9579982
Session info
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 4.0.4 (2021-02-15)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_Canada.1252         
#>  ctype    English_Canada.1252         
#>  tz       America/Los_Angeles         
#>  date     2021-04-14                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version date       lib source        
#>  backports     1.2.1   2020-12-09 [1] CRAN (R 4.0.3)
#>  cli           2.4.0   2021-04-05 [1] CRAN (R 4.0.4)
#>  crayon        1.4.1   2021-02-08 [1] CRAN (R 4.0.3)
#>  debugme       1.1.0   2017-10-22 [1] CRAN (R 4.0.2)
#>  digest        0.6.27  2020-10-24 [1] CRAN (R 4.0.3)
#>  ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.0)
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.0)
#>  fansi         0.4.2   2021-01-15 [1] CRAN (R 4.0.3)
#>  fs            1.5.0   2020-07-31 [1] CRAN (R 4.0.2)
#>  gdata       * 2.18.0  2017-06-06 [1] CRAN (R 4.0.4)
#>  glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.2)
#>  gtools        3.8.2   2020-03-31 [1] CRAN (R 4.0.3)
#>  highr         0.8     2019-03-20 [1] CRAN (R 4.0.0)
#>  hms           1.0.0   2021-01-13 [1] CRAN (R 4.0.3)
#>  htmltools     0.5.1.1 2021-01-22 [1] CRAN (R 4.0.3)
#>  knitr         1.31    2021-01-27 [1] CRAN (R 4.0.3)
#>  lifecycle     1.0.0   2021-02-15 [1] CRAN (R 4.0.4)
#>  lobstr      * 1.1.1   2019-07-02 [1] CRAN (R 4.0.5)
#>  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.0.3)
#>  pillar        1.5.1   2021-03-05 [1] CRAN (R 4.0.4)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.0)
#>  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.0.0)
#>  R.cache       0.14.0  2019-12-06 [1] CRAN (R 4.0.0)
#>  R.methodsS3   1.8.1   2020-08-26 [1] CRAN (R 4.0.2)
#>  R.oo          1.24.0  2020-08-26 [1] CRAN (R 4.0.2)
#>  R.utils       2.10.1  2020-08-26 [1] CRAN (R 4.0.2)
#>  R6            2.5.0   2020-10-28 [1] CRAN (R 4.0.3)
#>  Rcpp          1.0.6   2021-01-15 [1] CRAN (R 4.0.3)
#>  readr       * 1.4.0   2020-10-05 [1] CRAN (R 4.0.2)
#>  rematch2      2.1.2   2020-05-01 [1] CRAN (R 4.0.0)
#>  reprex        2.0.0   2021-04-02 [1] CRAN (R 4.0.5)
#>  rlang         0.4.10  2020-12-30 [1] CRAN (R 4.0.3)
#>  rmarkdown     2.7     2021-02-19 [1] CRAN (R 4.0.4)
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.0)
#>  stringi       1.5.3   2020-09-09 [1] CRAN (R 4.0.2)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.0.2)
#>  styler        1.4.1   2021-03-30 [1] CRAN (R 4.0.4)
#>  tibble        3.1.0   2021-02-25 [1] CRAN (R 4.0.4)
#>  utf8          1.2.1   2021-03-12 [1] CRAN (R 4.0.5)
#>  vctrs         0.3.7   2021-03-29 [1] CRAN (R 4.0.5)
#>  withr         2.4.1   2021-01-26 [1] CRAN (R 4.0.3)
#>  xfun          0.22    2021-03-11 [1] CRAN (R 4.0.4)
#>  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.0)
#> 
#> [1] C:/Users/salbers/R/win-library/4.0
#> [2] C:/Program Files/R/R-4.0.4/library

I don't. I think your best bet is to ask Jim on the GH issue.

1 Like

And just completeness sake, here is the reprex with the fixed version of readr:

library(readr)
library(lobstr)
#> Warning: package 'lobstr' was built under R version 4.0.5
suppressPackageStartupMessages(library(gdata, warn.conflicts = FALSE))
#> Warning in system(cmd, intern = intern, wait = wait | intern,
#> show.output.on.console = wait, : running command 'C:\WINDOWS\system32\cmd.exe /c
#> ftype perl' had status 2
#> Warning in system(cmd, intern = intern, wait = wait | intern,
#> show.output.on.console = wait, : running command 'C:\WINDOWS\system32\cmd.exe /c
#> ftype perl' had status 2

mem_used()
#> 54,031,456 B
# flexible fn to make fixed width
make_fwf <- function(nrows, file) {
  dat <- data.frame(
    x = runif(nrows),
    y = runif(nrows)
  )
  gdata::write.fwf(dat, file, colnames = FALSE)
  rm(dat)
  gc()

  R.utils::gzip(file)
}

fwf_sample <- make_fwf(1E6, "fwf-eg.fwf")

(start <- mem_used())
#> 63,573,320 B

f <- function(x, pos) {
  d <- read_fwf(fwf_sample, fwf_empty(fwf_sample, col_names = c("x", "y")), col_types = c("dd"))
  rm(d)
  gc()
}
read_lines_chunked(
  file = fwf_sample,
  callback = SideEffectChunkCallback$new(f),
  chunk_size = 50000,
  progress = FALSE
)
#> NULL
## Memory taken up
mem_used()
#> 65,777,192 B
## Memory added
mem_used() - start
#> 2,202,944 B


## Size of file
file.info(fwf_sample)$size
#> [1] 9580166
Session info
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 4.0.4 (2021-02-15)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_Canada.1252         
#>  ctype    English_Canada.1252         
#>  tz       America/Los_Angeles         
#>  date     2021-04-15                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version    date       lib source                          
#>  backports     1.2.1      2020-12-09 [1] CRAN (R 4.0.3)                  
#>  cli           2.4.0      2021-04-05 [1] CRAN (R 4.0.4)                  
#>  clock         0.2.0      2021-04-12 [1] CRAN (R 4.0.5)                  
#>  crayon        1.4.1      2021-02-08 [1] CRAN (R 4.0.3)                  
#>  debugme       1.1.0      2017-10-22 [1] CRAN (R 4.0.2)                  
#>  digest        0.6.27     2020-10-24 [1] CRAN (R 4.0.3)                  
#>  ellipsis      0.3.1      2020-05-15 [1] CRAN (R 4.0.0)                  
#>  evaluate      0.14       2019-05-28 [1] CRAN (R 4.0.0)                  
#>  fansi         0.4.2      2021-01-15 [1] CRAN (R 4.0.3)                  
#>  fs            1.5.0      2020-07-31 [1] CRAN (R 4.0.2)                  
#>  gdata       * 2.18.0     2017-06-06 [1] CRAN (R 4.0.4)                  
#>  glue          1.4.2      2020-08-27 [1] CRAN (R 4.0.2)                  
#>  gtools        3.8.2      2020-03-31 [1] CRAN (R 4.0.3)                  
#>  highr         0.8        2019-03-20 [1] CRAN (R 4.0.0)                  
#>  hms           1.0.0      2021-01-13 [1] CRAN (R 4.0.3)                  
#>  htmltools     0.5.1.1    2021-01-22 [1] CRAN (R 4.0.3)                  
#>  knitr         1.31       2021-01-27 [1] CRAN (R 4.0.3)                  
#>  lifecycle     1.0.0      2021-02-15 [1] CRAN (R 4.0.4)                  
#>  lobstr      * 1.1.1      2019-07-02 [1] CRAN (R 4.0.5)                  
#>  magrittr      2.0.1      2020-11-17 [1] CRAN (R 4.0.3)                  
#>  pillar        1.5.1      2021-03-05 [1] CRAN (R 4.0.4)                  
#>  pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.0.0)                  
#>  purrr         0.3.4      2020-04-17 [1] CRAN (R 4.0.0)                  
#>  R.cache       0.14.0     2019-12-06 [1] CRAN (R 4.0.0)                  
#>  R.methodsS3   1.8.1      2020-08-26 [1] CRAN (R 4.0.2)                  
#>  R.oo          1.24.0     2020-08-26 [1] CRAN (R 4.0.2)                  
#>  R.utils       2.10.1     2020-08-26 [1] CRAN (R 4.0.2)                  
#>  R6            2.5.0      2020-10-28 [1] CRAN (R 4.0.3)                  
#>  Rcpp          1.0.6      2021-01-15 [1] CRAN (R 4.0.3)                  
#>  readr       * 1.4.0.9000 2021-04-15 [1] Github (tidyverse/readr@68c2406)
#>  rematch2      2.1.2      2020-05-01 [1] CRAN (R 4.0.0)                  
#>  reprex        2.0.0      2021-04-02 [1] CRAN (R 4.0.5)                  
#>  rlang         0.4.10     2020-12-30 [1] CRAN (R 4.0.3)                  
#>  rmarkdown     2.7        2021-02-19 [1] CRAN (R 4.0.4)                  
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 4.0.0)                  
#>  stringi       1.5.3      2020-09-09 [1] CRAN (R 4.0.2)                  
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 4.0.2)                  
#>  styler        1.4.1      2021-03-30 [1] CRAN (R 4.0.4)                  
#>  tibble        3.1.0      2021-02-25 [1] CRAN (R 4.0.4)                  
#>  tzdb          0.1.0      2021-03-04 [1] CRAN (R 4.0.5)                  
#>  utf8          1.2.1      2021-03-12 [1] CRAN (R 4.0.5)                  
#>  vctrs         0.3.7      2021-03-29 [1] CRAN (R 4.0.5)                  
#>  withr         2.4.1      2021-01-26 [1] CRAN (R 4.0.3)                  
#>  xfun          0.22       2021-03-11 [1] CRAN (R 4.0.4)                  
#>  yaml          2.2.1      2020-02-01 [1] CRAN (R 4.0.0)                  
#> 
#> [1] C:/Users/salbers/R/win-library/4.0
#> [2] C:/Program Files/R/R-4.0.4/library
2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.