Using the "purrr", "foreach" and "doParallel" libraries to re-write loops

swaheera · July 5, 2021, 10:15pm

I wrote the following code in R which performs (loop) a series of data manipulation operations on some artificially generated data (the final output is called "final_results"):

#load library
    library(dplyr)

library(data.table)

set.seed(123)

# create some data for this example
a1 = rnorm(1000,100,10)
b1 = rnorm(1000,100,5)
c1 = sample.int(1000, 1000, replace = TRUE)
train_data = data.frame(a1,b1,c1)


####
results_table <- data.frame()

for (i in 1:10 ) {
    
    #generate random numbers
    random_1 =  runif(1, 80, 120)
    random_2 =  runif(1, random_1, 120)
    random_3 =  runif(1, 85, 120)
    random_4 =  runif(1, random_3, 120)
    
    #bin data according to random criteria
    train_data <- train_data %>% mutate(cat = ifelse(a1 <= random_1 & b1 <= random_3, "a", ifelse(a1 <= random_2 & b1 <= random_4, "b", "c")))
    
    train_data$cat = as.factor(train_data$cat)
    
    #new splits
    a_table = train_data %>%
        filter(cat == "a") %>%
        select(a1, b1, c1, cat)
    
    b_table = train_data %>%
        filter(cat == "b") %>%
        select(a1, b1, c1, cat)
    
    c_table = train_data %>%
        filter(cat == "c") %>%
        select(a1, b1, c1, cat)
    
    split_1 =  runif(1,0, 1)
    split_2 =  runif(1, 0, 1)
    split_3 =  runif(1, 0, 1)
    
    #calculate 60th quantile ("quant") for each bin
    
    table_a = data.frame(a_table%>% group_by(cat) %>%
                             mutate(quant = quantile(c1, prob = split_1)))
    
    table_b = data.frame(b_table%>% group_by(cat) %>%
                             mutate(quant = quantile(c1, prob = split_2)))
    
    table_c = data.frame(c_table%>% group_by(cat) %>%
                             mutate(quant = quantile(c1, prob = split_3)))
    
    
    
    
    #create a new variable ("diff") that measures if the quantile is bigger tha the value of "c1"
    table_a$diff = ifelse(table_a$quant > table_a$c1,1,0)
    table_b$diff = ifelse(table_b$quant > table_b$c1,1,0)
    table_c$diff = ifelse(table_c$quant > table_c$c1,1,0)
    
    #group all tables
    
    final_table = rbind(table_a, table_b, table_c)
    
    #create a table: for each bin, calculate the average of "diff"
    final_table_2 = data.frame(final_table %>%
                                   group_by(cat) %>%
                                   summarize(
                                       mean = mean(diff)
                                   ))
    
    #add "total mean" to this table
    final_table_2 = data.frame(final_table_2 %>% add_row(cat = "total", mean = mean(final_table$diff)))
    
    #format this table: add the random criteria to this table for reference
    final_table_2$random_1 = random_1
    
    final_table_2$random_2 = random_2
    
    final_table_2$random_3 = random_3
    
    final_table_2$random_4 = random_4
    
    final_table_2$split_1 = split_1
    
    final_table_2$split_2 = split_2
    
    final_table_2$split_3 = split_3
    
    final_table_2$iteration_number = i
    
    
    results_table <- rbind(results_table, final_table_2)
    
    final_results = dcast(setDT(results_table), iteration_number + random_1 + random_2 + random_3 + random_4 + split_1 + split_2 + split_3 ~ cat, value.var = 'mean')
    
}

The above loop works perfectly fine - but I am trying to learn more about R and trying to re-write this loop using other functions from other libraries such as the "doParallel", "foreach" and "purrr" libraries.

Option 1:

I came across the following code in R which shows the general template for writing loops using the "purrr" library (apparently "map_df" is a function that uses the code from the loop):

#option 1
library(dplyr)
library(purrr)
library(tictoc)


data_gen <- function(){ #here you insert your data generating process
  tibble(
    x = runif(100),
    y = runif(100)
  )
}

N <- 10000 #number of datasets do be generated


tic('method A')  #not necessary, measures the time of the code between 'tic' and 'toc'
output <- tibble(
  i = 1:N
) %>%
  split(.$i) %>%
  map_df(
    ~data_gen()
  )
toc()

However, I am not sure how this code can be adapted to fit my example. I first created the map_df function:

#create map_df function:

map_df <- function() {
    #bin data according to random criteria
    train_data <- train_data %>% mutate(cat = ifelse(a1 <= random_1 & b1 <= random_3, "a", ifelse(a1 <= random_2 & b1 <= random_4, "b", "c")))

    train_data$cat = as.factor(train_data$cat)

    #new splits
    a_table = train_data %>%
        filter(cat == "a") %>%
        select(a1, b1, c1, cat)

    b_table = train_data %>%
        filter(cat == "b") %>%
        select(a1, b1, c1, cat)

    c_table = train_data %>%
        filter(cat == "c") %>%
        select(a1, b1, c1, cat)

    split_1 =  runif(1,0, 1)
    split_2 =  runif(1, 0, 1)
    split_3 =  runif(1, 0, 1)

    #calculate 60th quantile ("quant") for each bin

    table_a = data.frame(a_table%>% group_by(cat) %>%
                             mutate(quant = quantile(c1, prob = split_1)))

    table_b = data.frame(b_table%>% group_by(cat) %>%
                             mutate(quant = quantile(c1, prob = split_2)))

    table_c = data.frame(c_table%>% group_by(cat) %>%
                             mutate(quant = quantile(c1, prob = split_3)))




    #create a new variable ("diff") that measures if the quantile is bigger tha the value of "c1"
    table_a$diff = ifelse(table_a$quant > table_a$c1,1,0)
    table_b$diff = ifelse(table_b$quant > table_b$c1,1,0)
    table_c$diff = ifelse(table_c$quant > table_c$c1,1,0)

    #group all tables

    final_table = rbind(table_a, table_b, table_c)

    #create a table: for each bin, calculate the average of "diff"
    final_table_2 = data.frame(final_table %>%
                                   group_by(cat) %>%
                                   summarize(
                                       mean = mean(diff)
                                   ))

    #add "total mean" to this table
    final_table_2 = data.frame(final_table_2 %>% add_row(cat = "total", mean = mean(final_table$diff)))

    #format this table: add the random criteria to this table for reference
    final_table_2$random_1 = random_1

    final_table_2$random_2 = random_2

    final_table_2$random_3 = random_3

    final_table_2$random_4 = random_4

    final_table_2$split_1 = split_1

    final_table_2$split_2 = split_2

    final_table_2$split_3 = split_3

    final_table_2$iteration_number = i


    results_table <- rbind(results_table, final_table_2)

    final_results = dcast(setDT(results_table), iteration_number + random_1 + random_2 + random_3 + random_4 + split_1 + split_2 + split_3 ~ cat, value.var = 'mean')
}

But when I try to run the general template, it produces the following error:

data_gen <- function(){ #here you insert your data generating process
    tibble(
        # create some data for this example
        a1 = rnorm(1000,100,10)
        b1 = rnorm(1000,100,5)
        c1 = sample.int(1000, 1000, replace = TRUE)
        train_data = data.frame(a1,b1,c1)
    )
}

N <- 10000 #number of datasets do be generated


tic('method A')  #not necessary, measures the time of the code between 'tic' and 'toc'
output <- tibble(
    i = 1:N
) %>%
    split(.$i) %>%
    map_df(
        ~data_gen()
    )
toc() 

Error in map_df(., ~data_gen()) : unused arguments (., ~data_gen())

Does anyone know why this error is being produced?

Option 2 :

I am not sure how the "doParallel" and "foreach" libraries can be used in my example. It seems that all examples with "doParallel" require the user to begin by defining the number of "cores" they wish the computer will use:

 library(doParallel)
 cl <- makeCluster(2)
 registerDoParallel(cl)

And in the end, the user has to instruct the computer to stop the process:

stopCluster(cl)

Beyond this, I am not sure how the "doParalell" and the "foreach" library can be used to benefit my example.

Can someone please show me this? Thanks

lucasgraybuck · July 6, 2021, 4:47am

The purrr::map() family of functions takes a vector or list as input, and applies a function to every value in that vector or list.

It looks to me like you've replaced the map_df() function, rather than providing a new function for map_df() to use.

You should change the name of your function to something descriptive (I'll just use demo_function() for now), and call it like so:

tic('method A')  #not necessary, measures the time of the code between 'tic' and 'toc'
output <- tibble(
  i = 1:N
) %>%
  split(.$i) %>%
  map_df(
    demo_function
  )
toc()

This seems like a somewhat convoluted way to call map_df() N times. You could instead simply do:

tic('method A')
output <- map_df(1:N, demo_function)
toc()

as long as your function allows the value (1:N) to be provided as a variable e.g.:

demo_function <- function(i) {
  <Your code here>
}

Watching Hadley Wickham's presentation about purrr is a great way to start getting the hang of these functional methods:

swaheera · July 6, 2021, 6:01am

@lucasgraybuck: thank you for your reply! I am still new to this and a bit confused - if you have time, could you write the full form of the code? much appreciated - thank you!

system · July 27, 2021, 6:02am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.