Caret and OpenMPI on cluster

Hi there,

My error message:

I've got the following issue when I want to use the caret package on a personal cluster (4 nodes of 20 cores each) with SLURM task manager:

> # Train model -----------------------------------------------------
> model <- train(Temp ~ ., data = data, method = "lm")
Error in { : 
  task 1 failed - "object 'requireNamespaceQuietStop' not found"
Calls: train ... train.default -> nominalTrainWorkflow -> %op% -> <Anonymous>
Execution halted

It seems to me that the internal caret function requireNamespaceQuietStop is not found.

How I launch my computation:

Command

sbatch job.sh

File job.sh

#!/bin/bash
#SBATCH --job-name=test_caret
#SBATCH --mail-type=ALL
#SBATCH --nodes=1                 # number of nodes
#SBATCH --ntasks-per-node=20      # number of cores
#SBATCH --time=48:00:00           # walltime
#SBATCH --output=r_job.out
#SBATCH --error=error.err

module load R
module load openmpi

mpirun R CMD BATCH script.R

File script.R

# Load library ---------------------------------------------------
library(doMPI)
library(caret)
library(dplyr)

# Create parallel backend ----------------------------------------
# Create repository for mpi logs
dir.create("log", showWarnings = FALSE)
# Create cluster
cl <- startMPIcluster(verbose = TRUE, logdir = "log")
registerDoMPI(cl)
nbCore <- clusterSize(cl)
print(nbCore)

# Load data -------------------------------------------------------
data("airquality")
data <- airquality %>% na.omit()

# Train model -----------------------------------------------------
model <- train(Temp ~ ., data = data, method = "lm")

# Stop parallel backend -------------------------------------------
stopCluster(cl)
mpi.quit()

Sofwares versions

Currently Loaded Modulefiles:
  1) hwloc/1.11.2               3) R/3.5.2
  2) openmpi/psm2/gcc49/2.0.1
> packageVersion('caret')
[1] 6.0.81
> packageVersion('doMPI')
[1] 0.2.2
> packageVersion('Rmpi')
[1] 0.6.6
> packageVersion('foreach')
[1] 1.4.4

Output

error.err

nothing

r_job.out

-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[39630,1],0]
  Exit code:    1
--------------------------------------------------------------------------

script.Rout


R version 3.5.2 (2018-12-20) -- "Eggshell Igloo"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> # Load library ---------------------------------------------------
> library(doMPI)
Loading required package: foreach
Loading required package: iterators
Loading required package: Rmpi
> library(caret)
Loading required package: lattice
Loading required package: ggplot2
> library(dplyr)

Attaching package: \u2018dplyr\u2019

The following objects are masked from \u2018package:stats\u2019:

    filter, lag

The following objects are masked from \u2018package:base\u2019:

    intersect, setdiff, setequal, union

> 
> # Create parallel backend ----------------------------------------
> # Create repository for mpi logs
> dir.create("log", showWarnings = FALSE)
> # Create cluster
> cl <- startMPIcluster(verbose = TRUE, logdir = "log")
> registerDoMPI(cl)
> nbCore <- clusterSize(cl)
> print(nbCore)
[1] 19
> 
> # Load data -------------------------------------------------------
> data("airquality")
> data <- airquality %>% na.omit()
> 
> # Train model -----------------------------------------------------
> model <- train(Temp ~ ., data = data, method = "lm")
Error in { : 
  task 1 failed - "object 'requireNamespaceQuietStop' not found"
Calls: train ... train.default -> nominalTrainWorkflow -> %op% -> <Anonymous>
Execution halted

Thanks in advance for any help !

That error occurs because, once a worker is started, caret has not been (re)loaded. It should work since we explicitly tell the foreach package to load it. The different parallel processing technologies inherit environments very differently and TBH I haven't used MPI in about 10 years.

Can you try doMC or doParallel instead to see if those work using either:


library(doMC)

registerDoMC(cores=4) 

or


library(doParallel)

cl <- makeForkCluster(nnodes = 4)

registerDoParallel(cl)

Hi Max,

Thank you for the answer.

I've tried both method and they don't crash !

Nevertheless I'm not sure that the computation is efficient with both these methods.

I think that the problem comes from the configuration of the parallelization from SLURM to R.

Indeed I've tried a simple foreach example either with doMC or doParallel and the result is the same.

It consists in computing 1000 times 1 million random number:

# Import packages
library(doParallel)
library(Rmpi)

# Number of iterations
iters <- 1000

# Get number of cores
print(parallel::detectCores())
nbCore <- mpi.universe.size()
print(nbCore)

# Open cluster
cl <- makeForkCluster(nnodes = nbCore)
registerDoParallel(cl)
foreach::getDoParWorkers()

# Start time
start <- Sys.time()

# Parallel Loop
result <- foreach(icount(iters)) %dopar% mean(rnorm(1e6))

# Time
print(Sys.time() - start)

# Close cluster
stopCluster(cl)

But I've got considerable differences in time computation depending on how I launch my job with SLURM.

The basic job.sh file is:

#!/bin/bash
#SBATCH --job-name=test_foreach
#SBATCH --nodelist=node00[1-2]
#SBATCH --ntasks=20
#SBATCH --time=48:00:00 # walltime

module load R
module load openmpi

mpirun R CMD BATCH --no-save test.R

and it is launched with the following command $ sbatch job.sh

Here is a table of what I get when I change the SLURM parameter --ntasks:

nbNode --ntasks Rmpi::universe.mpi.size() parallel::detectCores() foreach::getDoParWorkers() print(Sys.time() - start)
2 20 20 20 20 45 sec.
2 15 15 20 15 37 sec.
2 10 10 20 10 28 sec.
2 5 5 20 5 19 sec.
2 1 1 20 1 70 sec.

I have a big issue estimating the number of cpus I really allocate to my computation (which is clear on a local computer with parallel::detectCores()).

In my cluster case parallel::detectCores() always return 20 no matter how I change the --ntasks parameter. Why ?!?

So I prefer to rely on Rmpi::universe.mpi.size() which gives me the same number as --ntasks which makes me think that it corresponds to the number of workers/slaves I have created ?!? Ans I use this number to register my parallel backend.

The foreach::getDoParWorkers() returns the same number.

So my question is how can I make sure that I have really use the maximum number of cores I made available through mpirun in my R script on a foreach loop (and so on caret) ?

I have no experience with SLURM and haven't used Rmpi for 15 years so I can't help you there.

I've made the same sort of test with caret on a big matrix trained with a linear model and the result are equivalent.
I will open another topic on that type of issue.
Thanks by the way.

for computing with SLURM, you can look at

and related work

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.