Adding "Progress Bars" in R

swaheera · July 31, 2021, 3:19am

I am using the R programming language. I am trying to learn how to add "progress bars" to estimate how much time is remaining while a function is running (progress_bar function - RDocumentation).

For example:

library(progress)

pb <- progress_bar$new(total = 100)
for (i in 1:100) {
  pb$tick()
  Sys.sleep(1 / 100)
}

Suppose I have a function called "grid_function" and a dataset called "DF_1". I am taking each individual row from "DF_1" and feeding this row into "grid_function". "grid_function" performs some calculations using this row, and stores it into a "list" called "resultdf1". Finally, "resultdf1" is converted into a data frame called "final_output". The "feeding process" can be seen below:

resultdf1 <- apply(DF_1,1, # 1 means rows
                   FUN=function(x){
                       do.call(
                           # Call Function grid_function with the arguments in
                           # a list
                           grid_function,
                           # force list type for the arguments
                           c(list(train_data_new), as.list(
                               # make the row to a named vector
                               unlist(x)
                           )
                           ))
                   }
)

l = resultdf1
final_output = rbindlist(l, fill = TRUE)

Question: I would like to add a "progress bar" to the above code.

What I tried: I tried to do this as follows:

library(doParallel)
library(future)

#note: I think "makePSOCKcluster"  is making my code run faster, but I am not sure - I am open to suggestions!

cl <- makePSOCKcluster(6) # 6 cpu cores out of 8

registerDoParallel(cl)

    pb <- progress_bar$new(total = 100)
    
    for (i in 1:100) {
    resultdf1 <- apply(DF_1,1, # 1 means rows
                       FUN=function(x){
                           do.call(
                               # Call Function grid_function2 with the arguments in
                               # a list
                               grid_function,
                               # force list type for the arguments
                               c(list(train_data_new), as.list(
                                   # make the row to a named vector
                                   unlist(x)
                               )
                               ))
                       }
    )
    
    l = resultdf1
    final_output = rbindlist(l, fill = TRUE)
    
    
     pb$tick()
      Sys.sleep(1 / 100)
    }

stopCluster(cl)

This appears to be working, but I am not sure if I did everything correctly. Can someone please tell me if I have done this correctly? Is there any chance that adding this "progress bar" will actually result in the function taking more time to run?

Thanks

andresrcs · July 31, 2021, 1:07pm

The way you are doing it, absolutely, since for some reason you are repeating exactly the same process 100 times in a for loop and advancing the progress bar for each repetition.

I think you should set the total argument for progress_bar$new() to the number of rows in DF_1 and put pb$tick() inside your lambda function for apply() , although, I can't test this since your example is not reproducible because of the lack of sample data.

swaheera · August 1, 2021, 2:45am

@andresrcs Thank you so much for your answer! Here is the full code that I was using:

library(dplyr)
library(data.table)

results_table <- data.frame()

grid_function <- function(train_data, random_1, random_2, random_3, random_4, split_1, split_2, split_3) {
    
    
    
    #bin data according to random criteria
    train_data <- train_data %>% mutate(cat = ifelse(a1 <= random_1 & b1 <= random_3, "a", ifelse(a1 <= random_2 & b1 <= random_4, "b", "c")))
    
    train_data$cat = as.factor(train_data$cat)
    
    #new splits
    a_table = train_data %>%
        filter(cat == "a") %>%
        select(a1, b1, c1, cat)
    
    b_table = train_data %>%
        filter(cat == "b") %>%
        select(a1, b1, c1, cat)
    
    c_table = train_data %>%
        filter(cat == "c") %>%
        select(a1, b1, c1, cat)
    
    
    #calculate random quantile ("quant") for each bin
    
    table_a = data.frame(a_table%>% group_by(cat) %>%
                             mutate(quant = quantile(c1, prob = split_1)))
    
    table_b = data.frame(b_table%>% group_by(cat) %>%
                             mutate(quant = quantile(c1, prob = split_2)))
    
    table_c = data.frame(c_table%>% group_by(cat) %>%
                             mutate(quant = quantile(c1, prob = split_3)))
    
    
    
    
    #create a new variable ("diff") that measures if the quantile is bigger tha the value of "c1"
    table_a$diff = ifelse(table_a$quant > table_a$c1,1,0)
    table_b$diff = ifelse(table_b$quant > table_b$c1,1,0)
    table_c$diff = ifelse(table_c$quant > table_c$c1,1,0)
    
    #group all tables
    
    final_table = rbind(table_a, table_b, table_c)
    
    #create a table: for each bin, calculate the average of "diff"
    final_table_2 = data.frame(final_table %>%
                                   group_by(cat) %>%
                                   summarize(
                                       mean = mean(diff)
                                   ))
    
    #add "total mean" to this table
    final_table_2 = data.frame(final_table_2 %>% add_row(cat = "total", mean = mean(final_table$diff)))
    
    #format this table: add the random criteria to this table for reference
    final_table_2$random_1 = random_1
    
    final_table_2$random_2 = random_2
    
    final_table_2$random_3 = random_3
    
    final_table_2$random_4 = random_4
    
    final_table_2$split_1 = split_1
    
    final_table_2$split_2 = split_2
    
    final_table_2$split_3 = split_3
    
    
    
    
    results_table <- rbind(results_table, final_table_2)
    
    final_results = dcast(setDT(results_table), random_1 + random_2 + random_3 + random_4 + split_1 + split_2 + split_3 ~ cat, value.var = 'mean')
    
}

# create some data for this example
a1 = rnorm(1000,100,10)
b1 = rnorm(1000,100,5)
c1 = sample.int(1000, 1000, replace = TRUE)
train_data = data.frame(a1,b1,c1)




#grid
random_1 <- seq(80,100,5)
random_2 <- seq(85,120,5)
random_3 <- seq(85,120,5)
random_4 <- seq(90,120,5)
split_1 =  seq(0,1,0.1)
split_2 =  seq(0,1,0.1)
split_3 =  seq(0,1,0.1)
DF_1 <- expand.grid(random_1 , random_2, random_3, random_4, split_1, split_2, split_3)

#reduce the size of the grid for this example
DF_1 = DF_1[1:100,]

colnames(DF_1) <- c("random_1" , "random_2", "random_3",                     "random_4", "split_1", "split_2", "split_3")

train_data_new <- copy(train_data)


resultdf1 <- apply(DF_1,1, # 1 means rows
                   FUN=function(x){
                       do.call(
                           # Call Function grid_function2 with the arguments in
                           # a list
                           grid_function,
                           # force list type for the arguments
                           c(list(train_data_new), as.list(
                               # make the row to a named vector
                               unlist(x)
                           )
                           ))
                   }
)

l = resultdf1
final_output = rbindlist(l, fill = TRUE)

Is this how you think I should add the progress bar?

library(doParallel)
library(future)

#note: I think "makePSOCKcluster"  is making my code run faster, but I am not sure - I am open to suggestions!

cl <- makePSOCKcluster(6) # 6 cpu cores out of 8

registerDoParallel(cl)

    pb <- progress_bar$new(total = 2000)
    
    for (i in 1:100) {
    resultdf1 <- apply(DF_1[1:2000,],1, # 1 means rows
                       FUN=function(x){
                           do.call(
                               # Call Function grid_function2 with the arguments in
                               # a list
                               grid_function,
                               # force list type for the arguments
                               c(list(train_data_new), as.list(
                                   # make the row to a named vector
                                   unlist(x)
                               )
                               ))
                       }
    )
    
    l = resultdf1
    final_output = rbindlist(l, fill = TRUE)
    
    
     pb$tick()
      Sys.sleep(1 / 100)
    }

stopCluster(cl)

I changed the "total" argument so that it has the same number of rows as "DF_1".

Thank you so much for all your help!

andresrcs · August 1, 2021, 3:20am

You are still repeating the same thing 100 times inside a for loop without any apparent reason (unless I'm missing something) and advancing the progress bar with each iteration, so a progress bar with total=2000 is never going to end.

Whereas not impossible, it is less likely that someone is willing to work with your specific (and lengthy) code, you would greatly improve your chances of getting more specific help if you provide a minimal REPRoducible EXample (reprex) instead.

swaheera · August 1, 2021, 3:48am

@ andresrcs : thank you for your reply! I am not sure I understand - I thought that the way I posted the full question (in the reply) was reproducible?

andresrcs · August 1, 2021, 12:05pm

Probly it is (I haven't tried) but it is not minimal, if you read the reprex guide in the link I gave you before, you will learn that a proper reprex should include only the minimal amount of code necessary to reproduce your problem.

swaheera · August 2, 2021, 4:38am

thank you for your reply!

This is something I have struggled with - sometimes when I try to post "reproducible questions", people tell me that either I have included too much code, or not enough code. I can never quite seem to figure it out. I try to post enough code so that everyone can copy/paste my code and try to reproduce my results (I usually write some additional code that creates some sample data for the problem).

I tried to work on the "progress bar" today, but I was still having trouble. If you have time this week, can you please try to take a look at it?

thank you so much for everything

system · August 23, 2021, 4:38am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.