multicore processing : best practices when working in RStudio?

I've been happily using RStudio via rocker container running on Ubuntu22.

Recently I noticed a good load of code that I run bi-annually, stopped working.
the code uses mclapply to process 1000+ visualisation of sales data per item.
The script was working at least 6 months ago. (2022/10)

My guess was that something changed in R and parallel package.
In trying to troubleshoot the issue I came across multiple warnings about "use of forking in RStudio IDE is dangerous and not recommended"
which felt a little odd to me, since I've been using mclapply and other functions of parallel package with great success for years.

Do people not use parallel package when working with RStudio?
or do you ignore the warning and just use them like I did for 4 years with success?

Is there a Posit approved, "official" way to use all cores on our computer?
This particular pc is a Intel 14 core (28HT), 256GB ram machine.
When mclapply works it is hugely fast. Not being able to use it fully is causing much cry in the office.

Forking is dusallowed on windows, and I suppose when working in GUI in Linux is dangerous...
The alternative to forking is via socket. Search for PSOCK

I wonder what "working with GUI" actually means?

I build custom docker container based on rocker/verse image:
https://hub.docker.com/r/rocker/verse

The machine is a headless, R/Python docker only machine with nvidia gpu.
It is running ubuntu 22.04 server and no graphical environment is running on the box.

is this still considered to be "using with GUI" situation?

thanks for the reminder about using psock!
I should try parLapply I suppose.

So, it does not involve RStudio IDE ? if it doesnt then I wouldnt think you'd have a problem, but you did mention Rstudio IDE in your opening post.

Was there some specific error that you encountered ?

Apologies for my loose usage of terms.
I am using RStudio in a browser window, so no GUI (as in GNOME running on local Ubuntu install) environment is interacting with RStudio, but using IDE nonetheless as a web app.

My current issue with mclapply is not a specific error message, but a memory exhaustion issue when running any number of cores >= 2L. In the worst case making the whole server unstable.

My function references ~5GB tibble in the global env, no modifications to it.
There are many more objects in memory and looking at htop command on the server memory usage is around 15GB before running.

mcapply with mc.cores = 1L runs the function fine, but any number >= 2L then the memory usage increases indefinitely, eventually exhausts all 256GB of ram and swap file, then the server becomes unresponsive to network requests.
Restarting via docker will rectify the situation by releasing memory use, if I wait patiently for ssh login.
using higher number of cores will make the speed of exhaustion faster.

May be there has been changes to RStudio or linux kernel, forking behaviour has changed in the past 6 month?
Last time it run successfully I was on Ubuntu20.04, it was updated recently to Ubuntu22.04.

It is my understanding that forking via mclapply is copy-on-write, objects in memory are not "copied" until I try to make modifications to it.
Even if I fork 10 processes, it shouldn't immediately use 10x more memory - unless I try to modify 10 unique versions of it which I do not. (each time the function runs, it tries to make a subset of the 5GB tibble, each filtered to about 1% of original in size.)
Then if it did indeed use 10x more memory, there is still plenty more enough on a 256GB ram machine.

I read somewhere about cautioning against using libraries that tries to utilise multicore, in a function that is already called via mclapply run. Again I do not do this, the function is mostly dplyr data juggling.
I call ggplot() but the result is immediately assigned to an object, to not use a graphical device in the function.

Wonder if anyone else has experienced memory exhaustion issues like mine?

Here is someinfo why historically Rstudio IDE has not supported Forking :

1 Like

Ive had success in the past using foreach to facilitate parallel processing in R.
Using the foreach package (r-project.org)

1 Like

Thanks for the info! Wonder if the situation has improved from 2019?

And one more thanks for suggestion on using foreach doParallel.
%dopar% did indeed allow me to use multiple cores after I adjust my code to use %do% notation from mclapply method.

I confirmed more cores are utilised via htop:

However, performance seems to be greatly reduced from when using mclapply.
I am guessing it to be due to the overhead of data needing to be replicated for each subprocess call, which is not required when forking. (hope someone correct me if I am wrong.)

I just found a case possibly related to mine.

I have split my func into 2 funcs. 1 for data juggling and 2 for saving ggplot images.
Using ggsave, the process is so slow it almost seems to be frozen, but no memory exhaustion.

comment out ggsave and the process at least completes till the end.

This is probably obvious, but just in case, you can still run processes like this in R through either R launched in the terminal or Rscript command line call. I usually develop my code in RStudio but then execute the full job from R in a terminal.

1 Like

Here is an example of socket for each with ggsave; this code works.

library(foreach)
library(doParallel)
library(ggplot2)

# Create a cluster of 4 cores
cl <- makeCluster(4)
registerDoParallel(cl)

# Define a vector of file names for saving the plots
filenames <- paste0("plot", 1:4, ".png")

# Use foreach to loop over the file names and create and save a plot for each one
foreach(i = 1:4, .packages = "ggplot2") %dopar% {
  # Create a simple plot with 42 points
  p <- ggplot(data.frame(x = rnorm(42), y = rnorm(42)), aes(x, y)) +
    geom_point()
  
  # Save the plot to the corresponding file name
  ggsave(filename = filenames[i], plot = p)
}
1 Like

Thank you for all the helpful comments!!
First I need to apologize for blaming mclapply to be the cause of problem.

After looking into my code again, there has been a major spike in the amount of incoming data,
the data passed to the function was much bigger than I anticipated.

I cleaned up the preprocess and changed the way I pass data to each subprocess call.
(instead of passing the 5GB tibble everytime, I grouped and split the tibble into much much smaller chunks just necessary for each visualisation).

This made massive improvement on the memory usage, and much much faster processing speed which I imagine to be better cache locality on the pre-grouped tibbles.

After this was done, both %dopar% and mclapply now works beautifully, both via R console directly and in RStudio!!

In the testing, using mclapply was around 15% faster than using %dopar% clustering.

1 Like

So to conclude, yes you can use mclapply and other multi-thread tools in RStudio!
My summary :

  1. You can use run your script multi-threaded. just do not access devices that are not multithread safe (ie. never plot directly to the plot pane. use ggsave() to save directly to file)
  2. When something unexpected happens (like above), you can't catch it from within RStudio. Be prepared to stop and restart your server. Docker is your friend.
  3. It's probably wise not to try it, if you are on a server you don't have controls of. Try your code on your own machine before running on the others'.

I will keep using forking tools since I have control of most machines here.
It is strangely satisfying to see CPU usage jumping to 100% when you hit enter!
Hope someone find this summary useful.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.