# Calculation of millions of linear regressions is taking days, ideas on a more efficient approach?

I have a very big dataframe with millions of rows and I need to calculate the slopes of linear regressions for each row. I figured out a method to do it but it seems extremely inneficient.

Does anyone knows a more efficient way to do this? Thanks in advance!

The independent variable is a vector that goes from `c(1:years)` in this case goes from `1:6`.
The column name `V1_T1` represents the variable 1 in time 1.

``````library(dplyr)
library(tidyr)
library(broom)

df = data.frame(replicate(12, sample(0:10, 1000, rep = TRUE)))
colnames = c("V1_T1", "V1_T2", "V1_T3", "V1_T4", "V1_T5", "V1_T6", "V2_T1", "V2_T2", "V2_T3", "V2_T4", "V2_T5", "V2_T6")
colnames(df) = colnames

def_slope = function(df, name , first, last, years){
slope = c()
for (i in c(1:nrow(df))){ #in each row
m = df[i,] %>% #select the i row
dplyr::select({{first}}:{{last}}) %>% #select the columns
pivot_longer(c(1:years), names_to = "key", values_to = "value") %>% #transform the data for a linear regression input
mutate(ano = c(1:years)) %>%
lm(value ~ ano, data = .) %>% #calculate a linear regression
tidy() %>%
.[[2,2]] #extract the slope

slope = append(slope, m) #add the slope to the slope vector

}
slope = tibble(slope) %>% #create a column with a defined name
rename("{{name}}" := slope)

return(slope)
}

start = Sys.time()
slope = tibble( #creates a df with all the slopes
def_slope(df, trend_V1,    V1_T1, V1_T6, 6),
def_slope(df, trend_V2,    V2_T1, V2_T6, 6))
end = Sys.time()

end-start

Time difference of 22.93273 secs
``````

I benchmarked this is as much faster

``````library(tidyverse)

set.seed(42)
df = data.frame(replicate(12, sample(0:10, 1000, rep = TRUE)))
colnames = c("V1_T1", "V1_T2", "V1_T3", "V1_T4", "V1_T5", "V1_T6", "V2_T1", "V2_T2", "V2_T3", "V2_T4", "V2_T5", "V2_T6")
colnames(df) = colnames

dolms <- function(df,first,last){
(df_ <- select(df,
{{first}}:{{last}}) %>% t() %>% as_tibble)

(df_lists <- purrr::map(
df_,
~(bind_cols(val=.,ano=1:length(.)))
))

(df_lms <- purrr::map_dbl(
df_lists,
~lm(val~ano,data=.)\$coefficients[]
))
df_lms
}

start = Sys.time()
slope <- tibble(
trend_V1 = dolms(df, V1_T1, V1_T6),
trend_V2 = dolms(df, V2_T1, V2_T6)
)
end = Sys.time()

end-start``````
1 Like

This solution is great, it reduces the duration from 22.93s to 1.89s on my computer. I deeply appreciate your time 1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.