Is this the correct way to “parallelize” code in R?

swaheera · July 13, 2021, 5:20am

I am working with the R programming language. I came across this link over here which shows how to "parallelize" your code : Running R Code in Parallel | R-bloggers

As far as I understand, "parallelize" means to strategically allocate your computer's resources in order to run your code faster.

For instance, I can run the code below on my computer, but it takes a while to run:

#Load library:
library(mopsocd)

#load libraries
library(dplyr)


# create some data for this example
a1 = rnorm(1000,100,10)
b1 = rnorm(1000,100,10)
c1 = sample.int(1000, 1000, replace = TRUE)
train_data = data.frame(a1,b1,c1)

#define function:

funct_set <- function (x) {
    
    
    
    #bin data according to random criteria
    train_data <- train_data %>%
        mutate(cat = ifelse(a1 <= x[1] & b1 <= x[3], "a",
                            ifelse(a1 <= x[2] & b1 <= x[4], "b", "c")))
    
    train_data$cat = as.factor(train_data$cat)
    
    #new splits
    a_table = train_data %>%
        filter(cat == "a") %>%
        select(a1, b1, c1, cat)
    
    b_table = train_data %>%
        filter(cat == "b") %>%
        select(a1, b1, c1, cat)
    
    c_table = train_data %>%
        filter(cat == "c") %>%
        select(a1, b1, c1, cat)
    
    
    
    #calculate  quantile ("quant") for each bin
    
    table_a = data.frame(a_table%>% group_by(cat) %>%
                             mutate(quant = ifelse(c1 > x[5],1,0 )))
    
    table_b = data.frame(b_table%>% group_by(cat) %>%
                             mutate(quant = ifelse(c1 > x[6],1,0 )))
    
    table_c = data.frame(c_table%>% group_by(cat) %>%
                             mutate(quant = ifelse(c1 > x[7],1,0 )))
    
    f1 = mean(table_a$quant)
    f2 = mean(table_b$quant)
    f3 = mean(table_c$quant)
    
    
    #group all tables
    
    final_table = rbind(table_a, table_b, table_c)
    # calculate the total mean : this is what needs to be optimized
    
    f4 = mean(final_table$quant)
    
    
    return (c(f1, f2, f3, f4));
}


gn <- function(x) {
    g1 <- x[3] - x[1] >= 0.0
    g2 <- x[4] - x[2] >= 0.0
    g3 <- x[7] - x[6] >0
    g4<- x[6] - x[5] >0
    return(c(g1,g2,g3, g4))
}

## Set Arguments

varcount <- 7
fncount <- 4
lbound <- c(80,90,80,90,100, 200, 300)
ubound <- c(90,110,90,110,200, 300, 500)
optmin <- 0



#desired part to speed up
ex1 <- mopsocd(funct_set,gn, varcnt=varcount,fncnt=fncount,
               lowerbound=lbound,upperbound=ubound,opt=optmin)

Suppose I want to "speed up" the last part of the above code:

#part to speed-up
ex1 <- mopsocd(funct_set,gn, varcnt=varcount,fncnt=fncount,
                lowerbound=lbound,upperbound=ubound,opt=optmin)

Using the instructions from the website, you first need to see how many cores your computer has:

library(parallel)

detectCores()
[1] 8

cl <- makeCluster(8)

From here, you can now "parallelize" the code:

#parallelize code
results <- parSapply(cl , train_data , mopsocd(funct_set,gn, varcnt=varcount,fncnt=fncount,
                lowerbound=lbound,upperbound=ubound,opt=optmin))

# close cluster object
stopCluster(cl)

Question : The "results" object is still running on my computer - can someone please tell me if I have "parallelized" my code correctly?

Thanks

nirgrahamuk · July 13, 2021, 10:37am

You can not trivially parallelize arbitrary code by passing to wrapper functions, that depends on whether in principle the processes you are passing are independent and can run in parallel. I dont see that the mopsocd is of that nature as it seems to want to give you a single result, having consumed and processed your entire data. Therefore to use mopsocd and have parallel cores used (to a benefit) would require rewriting the internals of mopsocd which is probably out of scope (and I don't know if its possible or not).

parallelize in the approach you are attempting when it makes sense to process a part of the data at a time. If a function needs to consume all the data 'at once' , then you cant parallelise it in the way you are attempting.

When I look at your code, it seems to me like you will in fact be using all your cores, to calculate the same results across them. i.e. making your computer do a lot more work for a longer time, to provide no more benefit.

swaheera · July 13, 2021, 3:37pm

thank you for your reply! this cant be done with the doSNOW or foreach libraries?

nirgrahamuk · July 13, 2021, 3:44pm

none of those packages are capable of opening up the insides of 'mopsocd' and changing how it works. sorry.

You could in parallel send your whole data for processing against mopsococd on one core, and to other functions on other cores... that would be an option.

swaheera · July 13, 2021, 3:50pm

thank you so much for your reply! if you have some time later in the week, can you please show me how to do this option?

"You could in parallel send your whole data for processing against mopsococd on one core, and to other functions on other cores"

Thank you so much! I really appreciate all your help!

system · August 3, 2021, 3:50pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.