I've used doparallel
and foreach
to run my PhD analysis code on an HPC cluster, and it went fine
It required only minimally modifying my existing loops—instead of:
for (i in mylist) {
# do the thing
}
it's:
for (i = my_list, ...) %dopar% {
# do the thing
}
And if you change %dopar%
to just %do%
it'll run on a single thread, which is handy for debugging purposes.
There are a few gotchas, though. You need to add arguments to the for
statement that porovides for the packages your child processes need (they aren't exported by default), and you also need to pass any existing variables that they might need. This is the vignette I used to get started. Finally, output in the child processes isn't automatically logged, so you'll need to provide an argument for that too. If you do, I recommend that you attach some sort of child process identifier to any debugging output, because all of the child processes will have their logs in one file mixed together (since they run simultaneously), and you'll likely want to filter down to a single thread at a time when debugging.
If I were re-writing my code now, I'd probably also use futures
and purrr
. The former adds some niceties to doparallel
and foreach
if you choose to use that syntax, like automatically exporting packages. And it's backend-agnostic, from what I understand. Here's the article I read on it.
Whatever route you take, on a PBS system (like the supercomputer I use, Raijin at the National Computational Infrastructure) it's as simple as passing arguments to the job you queue up requesting multiple CPU cores. For example, I wrap my master R script in a shell script for the purposes of queueing it up, and that allows me to specify resource limits as comments at the top of the script (instead of poassing them as arguments to the qsub
command).
I hope that helps!