Parallel for loop on linux cluster - How to?


#1

I have a loop, where each iteration takes ~1 hour. As each iteration in the for loop is independent, I would like to perform this in parallel on a linux cluster, such that runtime will be ~1 hour, rather then 1h x n_iterations.
Pseudo code:

for( i in interations ){
  do time heavy iteration independent calculation
  save results to file
  go to next iteration
}
Read in and concatenate results
write concatenated results to single file

I am looking for experience in running such parallel loops?


#2

Are those iterations embarrassingly parallel or they depend on each other? If former, then I can suggest looking in future and/or furrr (purrr + future) packages. They make it easy to run parallel processes. I don't think I've done it on a cluster, but they for sure have some way to run it even in some sort of HPC environments, so clusters should be covered as well.


#3

I am not as brave as @Leon :slight_smile: I can't get too excited about dev code. There's a CRAN task view for parallel computing that should be helpful: https://cran.r-project.org/web/views/HighPerformanceComputing.html

It really depends on what parallel backend you want to run (Hadoop, Snow cluster, etc). You will need an R construct to do the parallelization, I recommend foreach but then you need a backend. The nice thing about foreach is that you can run local multicore loops for development. So you use a local backend to test things, then when you've run a loop or two you can use a parallel multi machine backend to blow your calculations out to your cluster and you only have to change one line of code: the line that sets up the backend.

Do you already have an existing cluster? What's running on it?

Credential signaling: I used to maintain an R package for doing this type of thing on Amazon's Elastic Map Reduce (EMR). I gave a lightening talk at R in Finance in 2011 about this topic. The prez is less than 2 min long, because I'm fast af:


#4
x86_64 GNU/Linux

With more than plenty of cores, disk and ram - I am lucky enough to have access to Danish National Supercomputer for Life Sciences. And it is indeed embarrassingly parallel, so I really do 'just' need to spawn children from the for loop


#5

what clustering software? Not OS. Is there already a framework running on the cluster for R, like snow? Is R installed on the cluster? Are you going to use a "custom virtual cluster" ... have others run R on the cluster in parallel? What did they use?


#6

I'd start by asking your cluster maintainers the above question.


#7

Take a look at these



https://hpcf.umbc.edu/other-packages/how-to-run-r-programs-on-maya/#heading_toc_j_5


#8

So the thing is - I know how to do it old-school style: Write a script, which can be called commandline-style with arguments, pre-partition the data and write a bash script calling the first script x number of times with corresponding data and arguments, putting each process in the back ground using &, but... I want to find out how to use R and do this staying in the RStudio IDE.

But I think you're right, I need to seek out other users of the super-computer and see what they've done :slightly_smiling_face:


#9

Thanks, I will take a look at this :slightly_smiling_face:


#10

I've used doparallel and foreach to run my PhD analysis code on an HPC cluster, and it went fine :slight_smile: It required only minimally modifying my existing loops—instead of:

for (i in mylist) {
  # do the thing
}

it's:

for (i = my_list, ...) %dopar% {
  # do the thing
}

And if you change %dopar% to just %do% it'll run on a single thread, which is handy for debugging purposes.

There are a few gotchas, though. You need to add arguments to the for statement that porovides for the packages your child processes need (they aren't exported by default), and you also need to pass any existing variables that they might need. This is the vignette I used to get started. Finally, output in the child processes isn't automatically logged, so you'll need to provide an argument for that too. If you do, I recommend that you attach some sort of child process identifier to any debugging output, because all of the child processes will have their logs in one file mixed together (since they run simultaneously), and you'll likely want to filter down to a single thread at a time when debugging.

If I were re-writing my code now, I'd probably also use futures and purrr. The former adds some niceties to doparallel and foreach if you choose to use that syntax, like automatically exporting packages. And it's backend-agnostic, from what I understand. Here's the article I read on it.

Whatever route you take, on a PBS system (like the supercomputer I use, Raijin at the National Computational Infrastructure) it's as simple as passing arguments to the job you queue up requesting multiple CPU cores. For example, I wrap my master R script in a shell script for the purposes of queueing it up, and that allows me to specify resource limits as comments at the top of the script (instead of poassing them as arguments to the qsub command).

I hope that helps!


#11

Cheers @rensa!

We too have to use qsub to submit jobs, so it seems we have a similar setup (Though I envy your access to multiple GPUs :metal: Computerome for now has a focus on CPUs)

You wouldn't happen to have a tiny 'dummy' script demonstrating your HPC workflow lying around?


#12

I haven't been brave enough to try and do analysis work with CUDA :sweat_smile: Maybe when I get a new laptop and have a weird problem to work on! One of my colleagues has been getting into it, though!

I can do you one better: my PhD code is in a public repo. Take a look at sched.sh in the root folder; that's my PBS script. The ones it calls are in the raijin folder; take a look at lines 35–45 and 100 of raijin/2-extract.r for an example of a parallel process. It's commented out at the moment (I was doing some debugging last time I ran that script), but I think that shows how easily the parallel loop drops in in place of the regular one. Also note, on line 100, that I explicitly stop the cluster (mostly because I start a new one right after, and I had problems with child processes crashing but not exiting cleanly—not fun to discover 12 hours later).

I hope that helps! Unfortunately there's nothing dummy about it, but in each case I basically build a data frame from my directory listing, chop it up into groups and feed each group into a child process. The child processes uses that subgroup of file listings to open input files, operate on them and write back out.


#13

You are likely using Sun Grid Engine as a scheduler. SGE is supported by batchtools so you can either use batchtools directly or you can leverage the much higher level of abstraction provided by the future family of packages. Specifically, future.batchtools implements the connection between future and batchtools.

If you want to structure your code as a for loop, doFuture makes the whole machinery available as a backend for the popular foreach package. We have been using the stack foreach -> doFuture -> future -> future.batchtools with SLURM for several years and are very happy with it.


#14

@davis has a really cool package you may be interested too: https://github.com/DavisVaughan/furrr


#15

If they do use qsub and Sun Grid Engine and can get an example going with future.batchtools then they can make it work with furrr. It should be as simple as setting plan(batchtools_sge) once everything is set up, then calling future_map().


#16

Thank you for a lot of input! Have any of you tried working with:
https://ropensci.github.io/drake/articles/parallelism.html by Will Landau?


#17

Drake can definitely add parallelism and HPC. See https://ropensci.github.io/drake/articles/parallelism.html#future-and-future_lapply for how to talk to clusters like SLURM and SGE. It won't parallelize a for loop though. Drake's parallelism is about walking through a directed acyclic graph of tasks and running each task at most once. The embedded videos at https://ropensci.github.io/drake/articles/parallelism.html#scheduling-algorithms demonstrate the parallel algorithms you can choose from.