# Adjust code that uses data.table function

The idea of ​​this code below is to find the `coef` value for each `date/category`. To find the value of `coef` I use the `lm` function. The code uses the `data.table` function. However, I would like to know if it is possible to optimize the code somehow and still get the same result? The idea is to reduce the code in order to decrease the processing time.

``````library(dplyr)
library(tidyr)
library(lubridate)
library(data.table)

#database
df1 <- structure(
list(date1= c("2021-06-28","2021-06-28","2021-06-28","2021-06-28","2021-06-28",
"2021-06-28","2021-06-28","2021-06-28"),
date2 = c("2021-06-25","2021-06-25","2021-06-27","2021-07-07","2021-07-07","2021-07-09","2021-07-09","2021-07-09"),
Code = c("FDE","ABC","ABC","ABC","CDE","FGE","ABC","CDE"),
Week= c("Wednesday","Wednesday","Friday","Wednesday","Wednesday","Friday","Friday","Friday"),
DR1 = c(4,1,4,3,6,4,3,5),
DR01 = c(4,1,4,3,3,4,3,6), DR02= c(4,2,6,7,3,2,7,4),DR03= c(9,5,4,3,3,2,1,5),
DR04 = c(5,4,3,3,6,2,1,9),DR05 = c(5,4,5,3,6,2,1,9),
DR06 = c(2,4,3,3,5,6,7,8),DR07 = c(2,5,4,4,9,4,7,8),
DR08 = c(4,0,0,1,2,4,4,4),DR09 = c(2,5,4,4,9,4,7,8),DR010 = c(2,5,4,4,9,4,7,8),DR011 = c(4,7,3,2,2,7,7,7),
DR012 = c(4,4,2,3,0,4,4,5),DR013 = c(4,4,1,4,0,3,2,0),DR014 = c(0,3,1,2,0,2,NA,NA)),
class = "data.frame", row.names = c(NA, -8L))

#Replace by value equal to 0 when you have columns equal to NA in DR0 columns

selection = startsWith(names(df1), "DR0")

df1[selection][is.na(df1[selection])] = 0

# First idea: Calculate the median of the values resulting from the subtraction between DR1 and the values of the DR0 columns

dt1 <- as.data.table(df1)

cols <- grep("^DR0", colnames(dt1), value = TRUE)

medi_ana <-
dt1[, (paste0(cols, "_PV")) := DR1 - .SD, .SDcols = cols
][, lapply(.SD, median), by = .(Code, Week), .SDcols = paste0(cols, "_PV") ]

# Second idea: After obtaining the median, I add the values found with the values of the DRM columns of my df1 database.

f1 <- function(nm, pat) grep(pat, nm, value = TRUE)
nm1 <- f1(names(df1), "^DR0\\d+\$")
nm2 <- f1(names(medi_ana), "_PV")
nm3 <- paste0("i.", nm2)
setDT(df1)[medi_ana,  (nm2) := Map(`+`, mget(nm1), mget(nm3)), on = .(Code, Week)]
SPV1 <- df1[, c('date1', 'date2', 'Code', 'Week', nm2), with = FALSE]

# Third idea: The idea here is to specifically filter a line from SPV1, which will depend on what the user chooses, for that it will be necessary to choose an Date and Code.

dmda<-"2021-07-09"
code<-"CDE"

SPV2<-melt(SPV1[date2 == dmda & Code == code][,
lapply(.SD, sum, na.rm = TRUE), by = Code,
.SDcols = patterns("^DR0")],
id.var = "Code", variable.name = "name", value.name = "val")[,

result<-SPV2[na.omit(SPV2[, .I[(as.Date(dmda) - min(as.Date(df1\$date1) [
df1\$Code == first(Code)])):max(name)+1], .(Code)]\$V1)]
> result
Code name val
1:  CDE   12   5
2:  CDE   13   5
3:  CDE   14   5

## After I calculate the result dataset, I used the lm function to obtain the coef value.

mod <- lm(val ~ I(name^2), result)
coef<-as.numeric(coef(mod))
> coef
 5``````

Here you go,

``````library(tidyverse)

df1 <- df1 %>% mutate(across(matches("^DR0"),~replace_na(.x,0)))

medi_ana_tib <- df1 %>% mutate(across(matches("^DR0"),~ DR1 - .x)) %>%
group_by(Code, Week) %>% summarise(across(matches("^DR0"),median)) %>%
rename_with(.fn = ~ str_glue("{.x}_PV"),.cols = matches("^DR0"))

SPV1_tib <- left_join(
medi_ana_tib %>% pivot_longer(
cols = -c(Code, Week),
names_to = c('DRs', 'attrb.'),
names_sep = '_',
values_to = 'PVs'
),
df1 %>% pivot_longer(
cols = -c(date1, date2, Code, Week),
names_to = 'DRs',
values_to = 'Vals'
),
by = c('Code', 'Week', 'DRs')
) %>% mutate(adjust_val = Vals + PVs) %>%
select(-PVs,-Vals) %>% pivot_wider(
names_from = c('DRs', 'attrb.'),
names_sep = '_',
)

dmda<-"2021-07-09"
code<-"CDE"

SPV2_tib <- SPV1_tib %>% filter(date2 == dmda & Code == code) %>%
group_by(Code) %>% summarise(across(matches("^DR0"),sum)) %>%
pivot_longer(-Code, names_to = 'name',values_to ='val') %>%
mutate(name = str_remove_all(name,"DR0|_PV") %>% as.numeric)

result_tib <- SPV2_tib %>% filter(
name > as.Date(dmda) - as.Date(df1 %>% filter(Code==code) %>% pull(date1) %>% min)
)

mod_tib <- lm(val ~ I(name^2), result_tib)
coef_tib<-as.numeric(coef(mod_tib))
``````

Thanks for reply @yifanliy. From what I understand you changed to `dplyr` function instead of `data.table`, is that it? Do you think that makes it faster, that is, decrease the processing time?

Generally, `dplyr` is much slower than `data.table`. Therefore, `tidytable` was born:

I also think `dplyr` is slower, so I did it for `data.table`. And do you think `tidytable` or `data.table` is faster?
oh, sorry for that, I didn't notice that you wanted to improve the processing time, I thought that you just wanted to reduce the code the `dplyr` can't compare `datatable` in the term of speed
`tidytable` is a direct translator from `dplyr` code to `data.table`. There is negligible, fixed overhead. You can see benchmarks here: speed_comparisons • tidytable
As you can see, the code readability is a massive upgrade from `dplyr` syntax, while virtually achieving 0 performance penalties from translation.