Does anyone know if the following code can be made to run in "parallel"

swaheera · July 29, 2021, 4:55am

I wrote the following code below that evaluates a function at every point contained in a very large "grid" (2981440 rows, 7 columns), and then summarizes these evaluations into a table. When the "grid" is small (e.g. 100 rows, 7 columns), the code runs successfully. But when the "grid" becomes large - I waited several hours, but the code still does not finish running.

Question: Using libraries such as "parallel", "foreach" and "SNOW" - is it possible to "give the computer an easier time with this code"? Perhaps using all the cores would result in this code running faster?

I am very new to parallel computing - could someone please take a look at this code below and let me know if you have any ideas on ways that I can make this code run faster?

library(dplyr)
library(data.table)

results_table <- data.frame()

grid_function <- function(train_data, random_1, random_2, random_3, random_4, split_1, split_2, split_3) {
    
    
    
    #bin data according to random criteria
    train_data <- train_data %>% mutate(cat = ifelse(a1 <= random_1 & b1 <= random_3, "a", ifelse(a1 <= random_2 & b1 <= random_4, "b", "c")))
    
    train_data$cat = as.factor(train_data$cat)
    
    #new splits
    a_table = train_data %>%
        filter(cat == "a") %>%
        select(a1, b1, c1, cat)
    
    b_table = train_data %>%
        filter(cat == "b") %>%
        select(a1, b1, c1, cat)
    
    c_table = train_data %>%
        filter(cat == "c") %>%
        select(a1, b1, c1, cat)
    
    
    #calculate random quantile ("quant") for each bin
    
    table_a = data.frame(a_table%>% group_by(cat) %>%
                             mutate(quant = quantile(c1, prob = split_1)))
    
    table_b = data.frame(b_table%>% group_by(cat) %>%
                             mutate(quant = quantile(c1, prob = split_2)))
    
    table_c = data.frame(c_table%>% group_by(cat) %>%
                             mutate(quant = quantile(c1, prob = split_3)))
    
    
    
    
    #create a new variable ("diff") that measures if the quantile is bigger tha the value of "c1"
    table_a$diff = ifelse(table_a$quant > table_a$c1,1,0)
    table_b$diff = ifelse(table_b$quant > table_b$c1,1,0)
    table_c$diff = ifelse(table_c$quant > table_c$c1,1,0)
    
    #group all tables
    
    final_table = rbind(table_a, table_b, table_c)
    
    #create a table: for each bin, calculate the average of "diff"
    final_table_2 = data.frame(final_table %>%
                                   group_by(cat) %>%
                                   summarize(
                                       mean = mean(diff)
                                   ))
    
    #add "total mean" to this table
    final_table_2 = data.frame(final_table_2 %>% add_row(cat = "total", mean = mean(final_table$diff)))
    
    #format this table: add the random criteria to this table for reference
    final_table_2$random_1 = random_1
    
    final_table_2$random_2 = random_2
    
    final_table_2$random_3 = random_3
    
    final_table_2$random_4 = random_4
    
    final_table_2$split_1 = split_1
    
    final_table_2$split_2 = split_2
    
    final_table_2$split_3 = split_3
    
    
    
    
    results_table <- rbind(results_table, final_table_2)
    
    final_results = dcast(setDT(results_table), random_1 + random_2 + random_3 + random_4 + split_1 + split_2 + split_3 ~ cat, value.var = 'mean')
    
}

# create some data for this example
a1 = rnorm(1000,100,10)
b1 = rnorm(1000,100,5)
c1 = sample.int(1000, 1000, replace = TRUE)
train_data = data.frame(a1,b1,c1)

#grid


#grid_2
random_1 <- seq(80,100,5)
random_2 <- seq(85,120,5)
random_3 <- seq(85,120,5)
random_4 <- seq(90,120,5)
split_1 =  seq(0,1,0.1)
split_2 =  seq(0,1,0.1)
split_3 =  seq(0,1,0.1)
DF_1 <- expand.grid(random_1 , random_2, random_3, random_4, split_1, split_2, split_3)

colnames(DF_1) <- c("random_1" , "random_2", "random_3",                     "random_4", "split_1", "split_2", "split_3")

train_data_new <- copy(train_data)


resultdf1 <- apply(DF_1,1, # 1 means rows
                   FUN=function(x){
                       do.call(
                           # Call Function grid_function2 with the arguments in
                           # a list
                           grid_function,
                           # force list type for the arguments
                           c(list(train_data_new), as.list(
                               # make the row to a named vector
                               unlist(x)
                           )
                           ))
                   }
)

l = resultdf1
final_output = rbindlist(l, fill = TRUE)

Thanks

gitdemont · July 29, 2021, 11:22am

Hi,
Yes !

library(dplyr)
library(data.table)

results_table <- data.frame()

grid_function <- function(train_data, random_1, random_2, random_3, random_4, split_1, split_2, split_3) {
  
  
  
  #bin data according to random criteria
  train_data <- train_data %>% mutate(cat = ifelse(a1 <= random_1 & b1 <= random_3, "a", ifelse(a1 <= random_2 & b1 <= random_4, "b", "c")))
  
  train_data$cat = as.factor(train_data$cat)
  
  #new splits
  a_table = train_data %>%
    filter(cat == "a") %>%
    select(a1, b1, c1, cat)
  
  b_table = train_data %>%
    filter(cat == "b") %>%
    select(a1, b1, c1, cat)
  
  c_table = train_data %>%
    filter(cat == "c") %>%
    select(a1, b1, c1, cat)
  
  
  #calculate random quantile ("quant") for each bin
  
  table_a = data.frame(a_table%>% group_by(cat) %>%
                         mutate(quant = quantile(c1, prob = split_1)))
  
  table_b = data.frame(b_table%>% group_by(cat) %>%
                         mutate(quant = quantile(c1, prob = split_2)))
  
  table_c = data.frame(c_table%>% group_by(cat) %>%
                         mutate(quant = quantile(c1, prob = split_3)))
  
  
  
  
  #create a new variable ("diff") that measures if the quantile is bigger tha the value of "c1"
  table_a$diff = ifelse(table_a$quant > table_a$c1,1,0)
  table_b$diff = ifelse(table_b$quant > table_b$c1,1,0)
  table_c$diff = ifelse(table_c$quant > table_c$c1,1,0)
  
  #group all tables
  
  final_table = rbind(table_a, table_b, table_c)
  
  #create a table: for each bin, calculate the average of "diff"
  final_table_2 = data.frame(final_table %>%
                               group_by(cat) %>%
                               summarize(
                                 mean = mean(diff)
                               ))
  
  #add "total mean" to this table
  final_table_2 = data.frame(final_table_2 %>% add_row(cat = "total", mean = mean(final_table$diff)))
  
  #format this table: add the random criteria to this table for reference
  final_table_2$random_1 = random_1
  
  final_table_2$random_2 = random_2
  
  final_table_2$random_3 = random_3
  
  final_table_2$random_4 = random_4
  
  final_table_2$split_1 = split_1
  
  final_table_2$split_2 = split_2
  
  final_table_2$split_3 = split_3
  
  
  
  
  results_table <- rbind(results_table, final_table_2)
  
  final_results = dcast(setDT(results_table), random_1 + random_2 + random_3 + random_4 + split_1 + split_2 + split_3 ~ cat, value.var = 'mean')
  
}

# create some data for this example
a1 = rnorm(1000,100,10)
b1 = rnorm(1000,100,5)
c1 = sample.int(1000, 1000, replace = TRUE)
train_data = data.frame(a1,b1,c1)

#grid


#grid_2
random_1 <- seq(80,100,5)
random_2 <- seq(85,120,5)
random_3 <- seq(85,120,5)
random_4 <- seq(90,120,5)
split_1 =  seq(0,1,0.1)
split_2 =  seq(0,1,0.1)
split_3 =  seq(0,1,0.1)
DF_1 <- expand.grid(random_1 , random_2, random_3, random_4, split_1, split_2, split_3)

colnames(DF_1) <- c("random_1" , "random_2", "random_3",                     "random_4", "split_1", "split_2", "split_3")

train_data_new <- copy(train_data)

args = list(X = DF_1[1:200, ], MARGIN = 1, FUN = function(x){ # I did not run it entirely because I guesss it can be very long so only the 200 st rows
  do.call(
    # Call Function grid_function2 with the arguments in
    # a list
    grid_function,
    # force list type for the arguments
    c(list(train_data_new), as.list(
      # make the row to a named vector
      unlist(x)
    )
    ))
})

library(microbenchmark)
library(future.apply)
plan(multisession)
microbenchmark(do.call(apply, args),
               do.call(future_apply, args),
               times = 5L)
#on my 4 cores laptop under Windows
#Unit: seconds
#                        expr      min       lq     mean   median       uq      max neval
#        do.call(apply, args) 5.552120 5.581287 5.674569 5.694984 5.741311 5.803142     5
# do.call(future_apply, args) 1.783736 1.784927 1.809077 1.798031 1.829703 1.848988     5

swaheera · July 29, 2021, 3:24pm

@gitdemont : Thank you so much for your answer! I spent the last hour studying your code (I have never seen some of these functions before) and had a few questions :

1) Where are the final results from this code? I am trying to find the "final data frame" that was produced by this code that contains the 200 rows. I looked at everything that was created and stored within the "global environment" within R, and I can't find anything.

For instance, if I run my code that I posted in the original question (I selected only the first 100 rows from "DF_1" so I can show you the output):

#this is the final output from my code
head(final_output)
   random_1 random_2 random_3 random_4 split_1 split_2 split_3   b         c total  a
1:       80       85       85       90     0.5     0.5     0.5 0.5 0.5000000 0.500 NA
2:       85       85       85       90     0.5     0.5     0.5 0.5 0.5000000 0.500 NA
3:       90       85       85       90     0.5     0.5     0.5 0.5 0.5000000 0.500 NA
4:       95       85       85       90     0.5     0.5     0.5 0.5 0.4994985 0.499  0
5:      100       85       85       90     0.5     0.5     0.5 0.5 0.4994985 0.499  0
6:       80       90       85       90     0.5     0.5     0.5 0.5 0.4989960 0.499 NA

In the code which you kindly provided, I am not sure where to access the "final_output" object that should contain the 200 rows on which the calculations were performed. Can you please show me where this is being created?

Am I supposed to run the following code to view the results?

result_1 = do.call(apply, args)
result_2  = do.call(future_apply, args)

When I ran "result_1" and "result_2", this seemed to have produced a similar output as "final_output". Is my understanding correct? Which of these (result_1 or result_2) is supposed to be the "faster" version? Based on the table in 2), it would appear that "result_2, i.e. future_apply) runs faster and therefore should be used instead of result_1? (i.e. based on min time, mean time, median time)

2) In the code which you wrote, it seems like the "microbenchmark" is running two different "versions" of this code for comparison purposes (i.e. apply vs future_apply)?

#on my 4 cores laptop under Windows
#Unit: seconds
#                        expr      min       lq     mean   median       uq      max neval
#        do.call(apply, args) 5.552120 5.581287 5.674569 5.694984 5.741311 5.803142     5
# do.call(future_apply, args) 1.783736 1.784927 1.809077 1.798031 1.829703 1.848988     5

Is this correct? Are these two versions of the same code run? What is the difference?

3) I tried looking online as to how people parallelize code in R. Usually, when people parallelize code, they usually use the following functions:

library(parallel)

detectCores()
[1] 8

cl <- makeCluster(8)

results <- parSapply( ....)

# close cluster object
stopCluster(cl)

Can I ask? Why did you not use these statements?

4) In general, can you please explain the logic that you used? What exactly have you done so that the code (e.g. "future_apply") runs faster than what I have written? Realistically, when the number of rows in the data increases to very large numbers: how much faster do you think "future_apply" would be compared to to running the code the way I wrote?

5) Last question: Regarding the approach you used (i.e. "future_apply") - what are the limitations? For example, if the number of rows grow to a very large number - will you still be able to use this code? If I were to normally run some R code on my computer that resulted in:

the R session terminating/aborting (i.e. "crashing")
the R session telling me that "I have run out of memory"

Does the "future_apply" code have any "inner-workings" that can either avoid or mitigate these kinds of problems?

I can not thank you enough for your help! This has been a great learning experience for me - I will continue to study your code and try to further familiarize myself with these new functions.

Thank you so much!

michaelbgarcia · July 29, 2021, 4:27pm

I've seen you ask this question multiple times before. I believe my original answer to you addressed this, as (at least in terms of your first post) there is no need for a loop. You should be able to use data.table which runs parallel by default:

gitdemont · July 29, 2021, 6:57pm

Woow, that's lot of questions !
1/ and 2/

the final results are not here indeed, I just answered your question "if the code can be run in parallel" and showed you a way to do it
the apply version and the future_apply version will produce same result indeed. I did not store the result in the variable resultdf1 (by the way there is no need to copy resultdf1 in l)
the microbenchmark library (and function) is here to compare execution time of the 2 versions (apply or future_apply)
yes using future_apply is faster and both future_apply and apply runs your code (I passed your code through args variable)
3/
-you are right, you can use this also to run code in parallel.
-take a look at parallel and future and choose your favorite one
4/
The logic is just that I run your code in both apply and future_apply, but future_apply used parallelization
I can not tell more than what I put in example for the running time. On my laptop with 4 cores I got the reported time. With more cores, more Hz, more ... there are some chances that it will be faster.
However indeed, with to much rows it can happen that you will exceed your computer capability (e.g.RAM).
5/
See above
One way to circumvent the errors you report would be t run less rows in one shot and to save intermediate results. For instance you can run the first 1:1000, then 1001:2000 and so on til the end.
Once again you have to adjust it on your system and see RAM and CPU consumption

My advice is: try and see

swaheera · July 29, 2021, 7:08pm

thank you so much for all your help!

I think I found a way to access the results, e.g.

a = do.call(future_apply, args)

This seems to store all the results!

swaheera · July 30, 2021, 3:58am

thank you for your reply! i remember you mentioning that "Data.Table" automatically runs code in parallel. I was still trying to find a way that would allow me to somehow manage running enormous amounts of calculations on a standard computer. thank you so much!

system · August 20, 2021, 3:59am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.