Killed on long file read

I keep getting my commands killed by RStudio Cloud (with only the message Killed). Is there a way I can get more information about whether it's memory or CPU (or something else?!) that I'm exceeding?

Is there a way to force all my functions (including those in libraries) to believe that there isn't more memory/CPU for them to use, so they "self police" and don't need to be killed by RStudio Cloud?

I'm not ready to post the actual link yet (for one thing I need to get the student's permission to share their session or the time to distill it down to a reprex) so for now I'm just asking for high-level strategies for diagnosing the sorts of problems that lead to Killed messages and for forcing (possibly third-party) functions to avoid behavior that will get them killed.

Thanks.

It likely that if you are seeing "Killed" messages, then you are running out of memory. Memory on RStudio Cloud is limited to 1GB. When the total memory used by the project approaches 1GB, the system will automatically kill the process using the most memory, which sounds like is happening to you.

Any details you can provide might be useful to troubleshoot further.

The specific point where the killing usually happens is when I use readxl::read_xlsx() on a large file. When the file goes beyond 4000 rows, that process gets killed, though there is some variation in exactly how many rows it takes from one run to another.

I have found so far the following ways of increasing the likelihood of not being killed:

  • devtools::install_github("krlmlr/ulimit");library(ulimit);memory_limit(size=900);
  • Converting the files to csv so I don't have to use the readxl library

The problem is that these are probabilistic. There ought to be a way to do one or more of the following (in order of importance):

  1. Programmatically check how close I am to the memory cap so I can attempt to hone in on the exact spot in the code that is at fault.
  2. A reliable way to make my scripts simply act like they are running on a computer with, say, 0.9G of RAM and beyond that... I guess use swap?

I tried manipulating the skip and n_rows arguments to force read_xlsx to read the file in chunks, but it gets killed on the first chunk so presumably it reads in the whole file each time regardless of which rows it returns.

The practical workaround I'm using for now is encouraging people to use csv files if they can, or to limit the size of the xlsx files they try to use.

Restricting memory usage can be bit tricky, especially in a containerized environment, but you can compare the limit vs your current usage by inspecting these two files that describe the current state of the cgroup memory controller (which governs memory for containers):

rstudio-user@application-1051893-deployment-3412526-fbhzf:/cloud/project$ cat /sys/fs/cgroup/memory/memory.limit_in_bytes
1073741824
rstudio-user@application-1051893-deployment-3412526-fbhzf:/cloud/project$ cat /sys/fs/cgroup/memory/memory.usage_in_bytes
86827008