Adjust code that uses data.table function

JojoSouza · April 20, 2022, 3:52am

The idea of this code below is to find the coef value for each date/category. To find the value of coef I use the lm function. The code uses the data.table function. However, I would like to know if it is possible to optimize the code somehow and still get the same result? The idea is to reduce the code in order to decrease the processing time.

library(dplyr)
library(tidyr)
library(lubridate)
library(data.table)

#database
df1 <- structure(
  list(date1= c("2021-06-28","2021-06-28","2021-06-28","2021-06-28","2021-06-28",
                "2021-06-28","2021-06-28","2021-06-28"),
       date2 = c("2021-06-25","2021-06-25","2021-06-27","2021-07-07","2021-07-07","2021-07-09","2021-07-09","2021-07-09"),
       Code = c("FDE","ABC","ABC","ABC","CDE","FGE","ABC","CDE"),
       Week= c("Wednesday","Wednesday","Friday","Wednesday","Wednesday","Friday","Friday","Friday"),
       DR1 = c(4,1,4,3,6,4,3,5),
       DR01 = c(4,1,4,3,3,4,3,6), DR02= c(4,2,6,7,3,2,7,4),DR03= c(9,5,4,3,3,2,1,5),
       DR04 = c(5,4,3,3,6,2,1,9),DR05 = c(5,4,5,3,6,2,1,9),
       DR06 = c(2,4,3,3,5,6,7,8),DR07 = c(2,5,4,4,9,4,7,8),
       DR08 = c(4,0,0,1,2,4,4,4),DR09 = c(2,5,4,4,9,4,7,8),DR010 = c(2,5,4,4,9,4,7,8),DR011 = c(4,7,3,2,2,7,7,7), 
       DR012 = c(4,4,2,3,0,4,4,5),DR013 = c(4,4,1,4,0,3,2,0),DR014 = c(0,3,1,2,0,2,NA,NA)),
  class = "data.frame", row.names = c(NA, -8L))

#Replace by value equal to 0 when you have columns equal to NA in DR0 columns

selection = startsWith(names(df1), "DR0")

df1[selection][is.na(df1[selection])] = 0

# First idea: Calculate the median of the values resulting from the subtraction between DR1 and the values of the DR0 columns

dt1 <- as.data.table(df1)

cols <- grep("^DR0", colnames(dt1), value = TRUE)

medi_ana <- 
  dt1[, (paste0(cols, "_PV")) := DR1 - .SD, .SDcols = cols
  ][, lapply(.SD, median), by = .(Code, Week), .SDcols = paste0(cols, "_PV") ]

# Second idea: After obtaining the median, I add the values found with the values of the DRM columns of my df1 database.
 
f1 <- function(nm, pat) grep(pat, nm, value = TRUE)
nm1 <- f1(names(df1), "^DR0\\d+$")
nm2 <- f1(names(medi_ana), "_PV")
nm3 <- paste0("i.", nm2)
setDT(df1)[medi_ana,  (nm2) := Map(`+`, mget(nm1), mget(nm3)), on = .(Code, Week)]
SPV1 <- df1[, c('date1', 'date2', 'Code', 'Week', nm2), with = FALSE]

# Third idea: The idea here is to specifically filter a line from SPV1, which will depend on what the user chooses, for that it will be necessary to choose an Date and Code.
  
dmda<-"2021-07-09"
code<-"CDE"

SPV2<-melt(SPV1[date2 == dmda & Code == code][, 
   lapply(.SD, sum, na.rm = TRUE), by = Code, 
   .SDcols = patterns("^DR0")],
    id.var = "Code", variable.name = "name", value.name = "val")[, 
      name := readr::parse_number(as.character(name))][]
 
result<-SPV2[na.omit(SPV2[, .I[(as.Date(dmda) - min(as.Date(df1$date1) [ 
         df1$Code == first(Code)])):max(name)+1], .(Code)]$V1)]
> result
   Code name val
1:  CDE   12   5
2:  CDE   13   5
3:  CDE   14   5

## After I calculate the result dataset, I used the lm function to obtain the coef value.

  mod <- lm(val ~ I(name^2), result)
  coef<-as.numeric(coef(mod)[1])
 > coef
[1] 5

yifanliu · April 21, 2022, 8:16am

Here you go,

library(tidyverse)


df1 <- df1 %>% mutate(across(matches("^DR0"),~replace_na(.x,0)))

medi_ana_tib <- df1 %>% mutate(across(matches("^DR0"),~ DR1 - .x)) %>% 
    group_by(Code, Week) %>% summarise(across(matches("^DR0"),median)) %>% 
    rename_with(.fn = ~ str_glue("{.x}_PV"),.cols = matches("^DR0"))
  
SPV1_tib <- left_join(
  medi_ana_tib %>% pivot_longer(
    cols = -c(Code, Week),
    names_to = c('DRs', 'attrb.'),
    names_sep = '_',
    values_to = 'PVs'
  ),
  df1 %>% pivot_longer(
    cols = -c(date1, date2, Code, Week),
    names_to = 'DRs',
    values_to = 'Vals'
  ),
  by = c('Code', 'Week', 'DRs')
) %>% mutate(adjust_val = Vals + PVs) %>% 
  select(-PVs,-Vals) %>% pivot_wider(
  names_from = c('DRs', 'attrb.'),
  names_sep = '_',
  values_from = 'adjust_val'
)

dmda<-"2021-07-09"
code<-"CDE"

SPV2_tib <- SPV1_tib %>% filter(date2 == dmda & Code == code) %>% 
  group_by(Code) %>% summarise(across(matches("^DR0"),sum)) %>% 
  pivot_longer(-Code, names_to = 'name',values_to ='val') %>% 
  mutate(name = str_remove_all(name,"DR0|_PV") %>% as.numeric)

result_tib <- SPV2_tib %>% filter(
  name > as.Date(dmda) - as.Date(df1 %>% filter(Code==code) %>% pull(date1) %>% min)
)

mod_tib <- lm(val ~ I(name^2), result_tib)
coef_tib<-as.numeric(coef(mod_tib)[1])

JojoSouza · April 21, 2022, 2:25pm

Thanks for reply @yifanliy. From what I understand you changed to dplyr function instead of data.table, is that it? Do you think that makes it faster, that is, decrease the processing time?

pathos · April 21, 2022, 2:27pm

Generally, dplyr is much slower than data.table. Therefore, tidytable was born:

JojoSouza · April 21, 2022, 2:45pm

Thanks for reply @pathos.

I also think dplyr is slower, so I did it for data.table. And do you think tidytable or data.table is faster?

yifanliu · April 22, 2022, 7:30am

oh, sorry for that, I didn't notice that you wanted to improve the processing time, I thought that you just wanted to reduce the code the dplyr can't compare datatable in the term of speed

pathos · April 29, 2022, 7:18am

tidytable is a direct translator from dplyr code to data.table. There is negligible, fixed overhead. You can see benchmarks here: speed_comparisons • tidytable

As you can see, the code readability is a massive upgrade from dplyr syntax, while virtually achieving 0 performance penalties from translation.

system · May 20, 2022, 7:18am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.