Reformatting Code to Run in Parallel

I am working with the R programming language. I am trying to fit a Random Forest model on a very large dataset (over 100 million rows) with imbalanced classes (i.e. binary response variable ratio 95% to 5%). To do this, the R code I wrote:

  • Step 1: Creates a training set and a test set for the sake of this question
  • Step 2: Uses sampling with replacement to create many random (smaller) subsets from the training set with a better distribution of the response variable (this is an attempt to increase the "true accuracy" of the model)
  • Step 3: Fits a Random Forest model to each of these random subsets and saves each model to the working directory (in case the computer crashes). Note - I am using the "ranger" package instead of the "randomForest" package because I read that the "ranger" package is faster.
  • Step 4: Combines all these models into a single model - and then makes predictions on the test set with this combined model

Below, I have included the R code for these steps:

Step 1: Create Data for Problem

# Step 1: Randomly create data and make initial training/test set:


library(dplyr)
library(ranger)

original_data = rbind( data_1 = data.frame( class = 1, height = rnorm(10000, 180,10), weight = rnorm(10000, 90,10), salary = rnorm(10000,50000,10000)),  data_2 = data.frame(class = 0, height = rnorm(100, 160,10), weight = rnorm(100, 100,10), salary = rnorm(100,40000,10000)) )

original_data$class = as.factor(original_data$class)
original_data$id = 1:nrow(original_data)

test_set=  rbind(original_data[ sample( which( original_data$class == "0" ) , replace = FALSE , 30 ) , ], original_data[ sample( which( original_data$class == "1" ) , replace = FALSE, 2000 ) , ])

train_set = anti_join(original_data, test_set)

Step 2: Create "Balanced" Random Subsets:

# Step 2: Create "Balanced" Random Subsets:

results <- list()
for (i in 1:100)
   
{
   iteration_i = i
   
    sample_i =  rbind(train_set[ sample( which( train_set$class == "0" ) , replace = TRUE , 50 ) , ], train_set[ sample( which( train_set$class == "1" ) , replace = TRUE, 60 ) , ])
   
    results_tmp = data.frame(iteration_i, sample_i)
    results_tmp$iteration_i = as.factor(results_tmp$iteration_i)
   results[[i]] <- results_tmp
   
}

results_df <- do.call(rbind.data.frame, results)

X<-split(results_df, results_df$iteration)

 invisible(lapply(seq_along(results),
       function(i,x) {assign(paste0("train_set_",i),x[[i]], envir=.GlobalEnv)},
       x=results))

Step 3: Train Models on Each Subset

# Step 3: Train Models on Each Subset:

#training
wd = getwd()
results_1 <- list()

for (i in 1:100){
     
    model_i <- ranger(class ~  height + weight + salary, data = X[[i]], probability = TRUE)
    saveRDS(model_i, paste0("wd", paste("model_", i, ".RDS")))
    results_1[[i]] <- model_i   
}

Step 4: Combine All Models and Use Combined Model to Make Predictions on the Test Set:

# Step 4: Combine All Models and Use Combined Model to Make Predictions on the Test Set:
results_2 <- list()
for (i in 1:100){
predict_i <- data.frame(predict(model_i, data = test_set)$predictions)


predict_i$id = 1:nrow(predict_i)
 results_2[[i]] <- predict_i
   
}

final_predictions = aggregate(.~ id, do.call(rbind, results_2), mean)

My Question: I would like to see if I can incorporate "parallel computing" into Step 2, Step 3 and Step 4 to potentially make the code I have written run faster. I consulted other posts (e.g.parallel execution of random forest in R - Stack Overflow, optimization - Parallelizing Random Forest learning in R changes the class of the RF object - Cross Validated) and I would like to see if I can reformat the code I have written and incorporate similar "parallel computing" functions for improving my code:

library(parallel)
library(doParallel)
library(foreach)

#Try to parallelize
cl <- makeCluster(detectCores()-1)
registerDoParallel(cl)

# Insert Reformatted Step 2 - Step 4 Here:

stopImplicitCluster()
stopCluster(cl)
rm(cl)

But I am still new to the world of parallel computing and still trying to figure out how to reformat my code so that this will work.

Can someone please show me how to do this?

Thanks!

Note:

  • In the previous questions that I consulted (e.g.parallel execution of random forest in R, optimization - Parallelizing Random Forest learning in R changes the class of the RF object - Cross Validated), the "randomForest" package is used instead of "ranger" I am also open to using the "randomForest" package if this will make it easier to parallelize .
  • I acknowledge that the overall structure of my code might not be optimally written - I am open to suggestions for re-writing my code if this will make it easier to parallelize.
  • I realize that there are several popular packages in R that can be used to parallelize code (e.g. CRAN - Package doSNOW) - I am also open to using any of these packages for parallelizing my code.
  • Finally, I am aware that there are some standard packages in R that are used for training Machine Learning models such as "Caret" and "Tidymodels". Perhaps what I am trying to accomplish can be done more easily using one of these packages - but I am not sure if this is the case. Regardless, if this can be done using "Caret"/"Tidymodels" - I am also open to this option.

This has been cross-posted on multiple sites simultaneously

https://www.reddit.com/r/rstats/comments/vflc0v/reformatting_code_to_run_in_parallel/

Please familiarize yourself with our cross-posting policy

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.