Parallelise a for loop inside an R-script using slurm

loukesio · April 15, 2022, 10:09am

I have thousands of data frames and I want to parallelise their analysis into slurm.
Here I am providing a simplified example:

I have an Rscript that I call: test.R

test.R contains these commands:

library(tidyverse)

df1 <- tibble(col1=c(1,2,3),col2=c(4,5,6))
df2 <- tibble(col1=c(7,8,9),col2=c(10,11,12))
files <- list(df1,df2)


for(i in 1:length(files)){
  df3 <- as.data.frame(files[1]) %>% 
    summarise(across(everything(), list(mean=mean,sd=sd)))
  
  write.table(df3, paste0("df",i))
}

^{Created on 2022-04-15 by the reprex package (v2.0.1)}

I want to parallelise the for loop and the analysis of each data frame to run as different.
Any help, guidance, tutorial are appreciated

would the array command help?

#!/bin/bash
### Job name
#SBATCH --job-name=parallel
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --time=10:00:00
#SBATCH --mem=16G
#SBATCH --array=1-2

module load R/4.1.3
Rscript test.R $SLURM_ARRAY_TASK_ID

pieterjanvc · April 15, 2022, 11:30am

Hi,

Take a look at the forEach package. It has a very easy way of converting a regular loop into a parallel one
https://cran.r-project.org/web/packages/foreach/vignettes/foreach.html

Hope this helps,
PJ

AlexisW · April 18, 2022, 12:41am

As I'm personally not used to arrays, you can get that done with several CPUs (=cores =threads), within a single task.

#SBATCH --job-name=parallel
#SBATCH --cpus-per-task=10
#SBATCH --time=10:00:00
#SBATCH --mem=16G

module load R/4.1.3
Rscript test.R $SLURM_CPUS_PER_TASK

Then you can start your R script with retrieving the parameter:

args <- commandArgs(TRUE)
nb_cpus <- args[[1]]

And use that number of CPUs with {foreach} or {furrr}.

With this method, there is a single job, so a single instance of R. That can be an advantage (you need to run the boilerplate code only once, it's usually simpler to think about it), or an inconvenient (since the loop is executed in parallel, R will try to load all these data frames in parallel and can run out of memory).

The job array (or similarly, a multi-node (=multi-process) approach) is different: you write an R script that processes a single file, and you ask slurm to start many jobs that each start that script independently. This is conceptually equivalent to calling sbatch many times. So, in that case there is no for loop in the R code.

[I have not really used job arrays in the past, so I might be missing a better solution.] My impression is that the job array will only provide you with $SLURM_ARRAY_TASK_ID, so you need to independently keep track of which file a given R job has to open. something like that:

batch file:

#!/bin/bash
### Job name
#SBATCH --job-name=parallel
#SBATCH --time=10:00:00
#SBATCH --mem=16G
#SBATCH --array=1-2

module load R/4.1.3
Rscript test.R $SLURM_ARRAY_TASK_ID

R file:

# Boilerplate ----
library(tidyverse)
df1 <- tibble(col1=c(1,2,3),col2=c(4,5,6))
df2 <- tibble(col1=c(7,8,9),col2=c(10,11,12))
files <- list(df1,df2)

# Find out where we are in the array ----
args <- commandArgs(TRUE)
current_task <- args[[1]]

# Run it ----
df3 <- as.data.frame(files[[current_task]]) %>% 
  summarise(across(everything(), list(mean=mean,sd=sd)))
  
write.table(df3, paste0("df",i))

If the boilerplate code is big, you might want to pre-compute it, write df1, df2 etc in a directory somewhere, then launch a job array like:

all_files <- list.files("/path/to/dir")

df <- read.table(all_files[[current_task]]) %>%
  as.data.frame() %>%
  summarise(across(everything(), list(mean=mean,sd=sd)))

write.table(df, paste0("df",i))

loukesio · April 18, 2022, 8:16am

thank you AlexisW, seems great
I ll give it a shot

system · April 25, 2022, 8:16am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.