How to reduce processing time of a code in R

JojoSouza · April 22, 2022, 12:04am

I would like some help from you regarding decreasing the processing time for coef calculation. The idea of this code is to generate a coef value for all Id, date and Category, but as seen when date2 > date1 (subset_df1).

As you can see I have a database with 900 obs, which is not much and besides that I generate coef value just for date2 > date1 (subset_df1), leaving only 230 obs. When I generate coef for all data from subset_df1 it took an average of 73.82 sec elapsed. In my opinion, this is time consuming, as there is little data. And my idea is to use a much larger database. How can I decrease this processing time?

library(tidyverse)
library(lubridate)
library(data.table)
library(tictoc)

df1 <- data.frame( Id = rep(1:5, length=900),
                  date1 =  as.Date( "2021-12-01"),
                  date2= rep(seq( as.Date("2021-01-01"), length.out=450, by=1), each = 2),
                  Category = rep(c("ABC", "EFG"), length.out = 900),
                  Week = rep(c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday",
                               "Saturday", "Sunday"), length.out = 900),
                  DR1 = sample( 200:250, 900, repl=TRUE),  
                  setNames( replicate(365, { sample(0:900, 900)}, simplify=FALSE),
                            paste0("DRM0", formatC(1:365, width = 2, format = "d", flag = "0"))))


return_values <- function (df1,idd,dmda, CategoryChosse) {
 
# First idea: Calculate the median of the values resulting from the subtraction between DR1 and the values of the DRM columns

dt1 <- as.data.table(df1)

cols <- grep("^DRM0", colnames(dt1), value = TRUE)

med <- 
  dt1[, (paste0(cols, "_PV")) := DR1 - .SD, .SDcols = cols
  ][, lapply(.SD, median), by = .(Id, Category, Week), .SDcols = paste0(cols, "_PV") ]

# Second idea: After obtaining the median, I add the values found with the values of the DRM columns of my df1 database.

f2 <- function(nm, pat) grep(pat, nm, value = TRUE)
nm1 <- f2(names(df1), "^DRM0\\d+$")
nm2 <- f2(names(med), "_PV")
nm3 <- paste0("i.", nm2)
setDT(df1)[med,(nm2) := Map(`+`, mget(nm1), mget(nm3)), on = .(Id, Category, Week)]
SPV <- df1[, c('Id','date1', 'date2', 'Week','Category', nm2), with = FALSE]#%>%data.frame

# Third idea: Coef values

coef<-SPV %>%
  filter(Id==idd, date2 == ymd(dmda), Category == CategoryChosse) %>%
  pull(as.numeric(ymd(dmda)-ymd(min(df1$date1)))+6)
  
return(coef)

}

subset_df1 <- subset(df1, date2 > date1)

tic()
subset_df1 %>%
  rowwise %>%
  mutate(return_values(df1,Id, date2, Category)) %>%
  select(-c(Week,starts_with('DR'))) 
toc()
# A tibble: 230 x 5
# Rowwise: 
      Id date1      date2      Category `return_values(df1, Id, date2, Category)`
   <int> <date>     <date>     <chr>                                        <dbl>
 1     1 2021-12-01 2021-12-02 ABC                                            206
 2     2 2021-12-01 2021-12-02 EFG                                             22
 3     3 2021-12-01 2021-12-03 ABC                                             -2
 4     4 2021-12-01 2021-12-03 EFG                                            328
 5     5 2021-12-01 2021-12-04 ABC                                            148
 6     1 2021-12-01 2021-12-04 EFG                                            569
 7     2 2021-12-01 2021-12-05 ABC                                           -375
 8     3 2021-12-01 2021-12-05 EFG                                            216
 9     4 2021-12-01 2021-12-06 ABC                                            406
10     5 2021-12-01 2021-12-06 EFG                                            217
# ... with 220 more rows
> toc()
73.82 sec elapsed

nirgrahamuk · April 22, 2022, 10:09am

The set up is presented in a very 'abstract fashion' for my money, so at least for myself as a programmer, its hard to reason about. Also given that it uses data.table syntax which I'm not invested time in interpreting, some things seem very strange the (the nm3, where i. is appended to column names) not sure what that is doing /why.

But I tried a naive optimisation to process less data through the function, and for the example data, the results matched and it ran in half the time.

library(tidyverse)
library(lubridate)
library(data.table)
library(bench)

df1 <- data.frame( Id = rep(1:5, length=900),
                   date1 =  as.Date( "2021-12-01"),
                   date2= rep(seq( as.Date("2021-01-01"), length.out=450, by=1), each = 2),
                   Category = rep(c("ABC", "EFG"), length.out = 900),
                   Week = rep(c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday",
                                "Saturday", "Sunday"), length.out = 900),
                   DR1 = sample( 200:250, 900, repl=TRUE),  
                   setNames( replicate(365, { sample(0:900, 900)}, simplify=FALSE),
                             paste0("DRM0", formatC(1:365, width = 2, format = "d", flag = "0"))))


return_values <- function (df1,idd,dmda, CategoryChosse) {
  
  # First idea: Calculate the median of the values resulting from the subtraction between DR1 and the values of the DRM columns
  
  dt1 <- as.data.table(df1)
  
  cols <- grep("^DRM0", colnames(dt1), value = TRUE)
  
  med <- 
    dt1[, (paste0(cols, "_PV")) := DR1 - .SD, .SDcols = cols
    ][, lapply(.SD, median), by = .(Id, Category, Week), .SDcols = paste0(cols, "_PV") ]
  
  # Second idea: After obtaining the median, I add the values found with the values of the DRM columns of my df1 database.
  
  f2 <- function(nm, pat) grep(pat, nm, value = TRUE)
  nm1 <- f2(names(df1), "^DRM0\\d+$")
  nm2 <- f2(names(med), "_PV")
  nm3 <- paste0("i.", nm2)
  setDT(df1)[med,(nm2) := Map(`+`, mget(nm1), mget(nm3)), on = .(Id, Category, Week)]
  SPV <- df1[, c('Id','date1', 'date2', 'Week','Category', nm2), with = FALSE]#%>%data.frame
  
  # Third idea: Coef values
  
  coef<-SPV %>%
    filter(Id==idd, date2 == ymd(dmda), Category == CategoryChosse) %>%
    pull(as.numeric(ymd(dmda)-ymd(min(df1$date1)))+6)
  
  return(coef)
  
}

return_valuesX <- function (df1,idd,dmda, CategoryChosse) {
  
  # First idea: Calculate the median of the values resulting from the subtraction between DR1 and the values of the DRM columns
  
  dt1 <- as.data.table(df1)
  num_to_pull <- as.numeric(ymd(dmda)-ymd(min(df1$date1)))+6
  cols <- grep("^DRM0", colnames(dt1), value = TRUE)[1:num_to_pull]
  
  med <- 
    dt1[, (paste0(cols, "_PV")) := DR1 - .SD, .SDcols = cols
    ][, lapply(.SD, median), by = .(Id, Category, Week), .SDcols = paste0(cols, "_PV") ]
  
  # Second idea: After obtaining the median, I add the values found with the values of the DRM columns of my df1 database.
  
  f2 <- function(nm, pat) grep(pat, nm, value = TRUE)
  nm1 <- f2(names(df1), "^DRM0\\d+$")[1:num_to_pull]
  nm2 <- f2(names(med), "_PV")[1:num_to_pull]
  nm3 <- paste0("i.", nm2)[1:num_to_pull]

  # df1 <- df1 %>% select(Id,Category,Week,date1,date2,all_of(nm1))
  setDT(df1)[med,(nm2) := Map(`+`, mget(nm1), mget(nm3)), on = .(Id, Category, Week)]
  SPV <- df1[, c('Id','date1', 'date2', 'Week','Category', nm2), with = FALSE]#%>%data.frame
  
  # Third idea: Coef values
  
  coef<-SPV %>%
    filter(Id==idd, date2 == ymd(dmda), Category == CategoryChosse) %>%
    pull(num_to_pull)
  
  return(coef)
  
}

subset_df1 <- subset(df1, date2 > date1)



bench::mark(a=subset_df1 %>%
              rowwise %>%
              mutate(result=return_values(df1,Id, date2, Category)) %>%
              select(-c(Week,starts_with('DR'))) ,
            b=subset_df1 %>% 
              rowwise %>%
              mutate(result=return_valuesX(df1,Id, date2, Category)) %>%
              select(-c(Week,starts_with('DR'))) ,iterations = 10)

Timing Results

# A tibble: 2 x 13
  expression   min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result       memory     time       gc      
  <bch:expr> <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>       <list>     <list>     <list>  
1 a          32.6s  34.4s    0.0291    8.74GB    1.18     10   406      5.72m <rowwise_df> <Rprofmem> <bench_tm> <tibble>
2 b          14.9s  15.5s    0.0640    2.22GB    0.832    10   130       2.6m <rowwise_df> <Rprofmem> <bench_tm> <tibble>

JojoSouza · April 22, 2022, 12:57pm

Thank you so much for reply @nirgrahamuk!
Could you please test your code again, I ran all the code you entered in the answer, but it didn't give the bench result like yours. It appears as if it's running, but it never finishes running, like this:

> bench::mark(a=subset_df1 %>%
+ rowwise %>%
+ mutate(result=return_values(df1,Id, date2, Category)) %>%
+ select(-c(Week,starts_with('DR'))) ,
+ b=subset_df1 %>%
+ rowwise %>%
+ mutate(result=return_valuesX(df1,Id, date2, Category)) %>%
+ select(-c(Week,starts_with('DR'))) ,iterations = 10)

Thank you again!

nirgrahamuk · April 22, 2022, 1:12pm

its calculating the original and the proposed improvement 10 times each. This takes significant time, as you can see from my metrics, the total_time was 5.72+2.6 minutes. My laptop ran your original example once in about half the time as you said it took you, therefore the bench::mark may take twice as long for you as for me. close to 18minutes.
You can reduce the iterations down from 10, in the bench mark parameters, for example you could decide you wanted only 1 iteration.
Alternatively you could reduce the rows you process in both passes before the rowwise() setting, by slicing to send only the first n rows i.e. for 1:10

%>% slice(1:10) %>% rowwise %>% etc.

JojoSouza · April 22, 2022, 2:34pm

Thank you very much @nirgrahamuk, it has considerably improved the time on small changes you made to the code. Do you think there is still a possibility to improve the code to make it faster? I ask this because I have a much larger database, and even with this reduction in time using your code, the time is still high. If you have any advice/tips I would appreciate it.

JojoSouza · April 23, 2022, 12:13am

@nirgrahamuk , I have a suggestion for adjustment, but I don't know how to do it, but I would like to see what you think. And If I reshape the data once, calculate the medians and differences once, and then filter within each Id/Category group based on the date difference. What do you think about this?

system · April 30, 2022, 12:14am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.