Understanding the impact of “makePSOCKcluster”

Recently, I came across the following function in R makePSOCKcluster , which can be used to "accelerate" the speed at which R can perform certain tasks (makeClusterPSOCK function - RDocumentation).

Based on some posts that I have seen, it seems like the makePSOCKcluster acts like a "wrapper" in which you place your code that would like to "accelerate" , for example:


cl <- makePSOCKcluster(6) # 6 cpu cores out of 8


### enter your code here

stopCluster(cl) # when finished`

I tried to adapt this setup to help accelerate an R procedure I am using:

cl <- makePSOCKcluster(6) # 6 cpu cores out of 8




results_table <- data.frame()

grid_function <- function(train_data, random_1, random_2, random_3, random_4, split_1, split_2, split_3) {
    #bin data according to random criteria
    train_data <- train_data %>% mutate(cat = ifelse(a1 <= random_1 & b1 <= random_3, "a", ifelse(a1 <= random_2 & b1 <= random_4, "b", "c")))
    train_data$cat = as.factor(train_data$cat)
    #new splits
    a_table = train_data %>%
        filter(cat == "a") %>%
        select(a1, b1, c1, cat)
    b_table = train_data %>%
        filter(cat == "b") %>%
        select(a1, b1, c1, cat)
    c_table = train_data %>%
        filter(cat == "c") %>%
        select(a1, b1, c1, cat)
    #calculate random quantile ("quant") for each bin
    table_a = data.frame(a_table%>% group_by(cat) %>%
                             mutate(quant = quantile(c1, prob = split_1)))
    table_b = data.frame(b_table%>% group_by(cat) %>%
                             mutate(quant = quantile(c1, prob = split_2)))
    table_c = data.frame(c_table%>% group_by(cat) %>%
                             mutate(quant = quantile(c1, prob = split_3)))
    #create a new variable ("diff") that measures if the quantile is bigger tha the value of "c1"
    table_a$diff = ifelse(table_a$quant > table_a$c1,1,0)
    table_b$diff = ifelse(table_b$quant > table_b$c1,1,0)
    table_c$diff = ifelse(table_c$quant > table_c$c1,1,0)
    #group all tables
    final_table = rbind(table_a, table_b, table_c)
    #create a table: for each bin, calculate the average of "diff"
    final_table_2 = data.frame(final_table %>%
                                   group_by(cat) %>%
                                       mean = mean(diff)
    #add "total mean" to this table
    final_table_2 = data.frame(final_table_2 %>% add_row(cat = "total", mean = mean(final_table$diff)))
    #format this table: add the random criteria to this table for reference
    final_table_2$random_1 = random_1
    final_table_2$random_2 = random_2
    final_table_2$random_3 = random_3
    final_table_2$random_4 = random_4
    final_table_2$split_1 = split_1
    final_table_2$split_2 = split_2
    final_table_2$split_3 = split_3
    results_table <- rbind(results_table, final_table_2)
    final_results = dcast(setDT(results_table), random_1 + random_2 + random_3 + random_4 + split_1 + split_2 + split_3 ~ cat, value.var = 'mean')

# create some data for this example
a1 = rnorm(1000,100,10)
b1 = rnorm(1000,100,5)
c1 = sample.int(1000, 1000, replace = TRUE)
train_data = data.frame(a1,b1,c1)

random_1 <- seq(80,100,5)
random_2 <- seq(85,120,5)
random_3 <- seq(85,120,5)
random_4 <- seq(90,120,5)
split_1 =  seq(0,1,0.1)
split_2 =  seq(0,1,0.1)
split_3 =  seq(0,1,0.1)
DF_1 <- expand.grid(random_1 , random_2, random_3, random_4, split_1, split_2, split_3)

#reduce the size of the grid for this example
DF_1 = DF_1[1:100000,]

colnames(DF_1) <- c("random_1" , "random_2", "random_3",                     "random_4", "split_1", "split_2", "split_3")

train_data_new <- copy(train_data)

resultdf1 <- apply(DF_1,1, # 1 means rows
                           # Call Function grid_function2 with the arguments in
                           # a list
                           # force list type for the arguments
                           c(list(train_data_new), as.list(
                               # make the row to a named vector

l = resultdf1
final_output = rbindlist(l, fill = TRUE)

### END

# when finished`

If I were to run the above code "locally", it would take a very long time to run. I ran the above code using the "makePSOCKcluster" wrapper - the code is still running.

Question: I am not sure if the "makePSOCKcluster" will actually make a difference - I am also not sure if I have used the "makePSOCKcluster" wrapper in the correct way. Can someone please tell me if what I am doing is correct? Are there any other ways to accelerate this code?


This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.