How to use for loop for linear regression using long data frame in R?

Hello! Any help with my question would be greatly appreciated, and I thank you for your time in advance. I am fairly new to R, so apologies if this is a simple question. Also, I checked for duplicate posts and found a similar post, but my question is slightly different because it deals with the data in a long format.

Here are screenshots of the data I am working with:

The columns I am interested in are 'dma' 'weekly_deaths' and 'gtrends'. I would like to regress the gtrends data onto the weekly_deaths data for each unique 'dma'. So for example, I would like to create a simple linear regression model for gtrends ~ weekly_deaths for all of the rows with a dma =1, then do the same thing for dma =2, so on and so forth. The data for each unique dma comes after each other, as seen from the second screenshot (the dma =2 data starts after the last row for dma =1 data).

There are 210 dmas total, which is why I would like a loop to do the regressions for me instead of running 210 separate ones.

Ultimately, I would like the loop to give me the linear regression coefficient, p-value, and multiple r-squared for each dma regression.

I've tried things like:

reg <- total_gtrends_deaths_df %>% group_by(total_gtrends_deaths_df$dma) %>% do(model=lm(total_gtrends_deaths_df$gtrends ~ total_gtrends_deaths_df$weekly_deaths))

and

for (i in 1:160) {
reg <- lm(weekly_deaths~gtrends, data = subset(total_gtrends_deaths_df, dma==i))
}
to no avail (again, I am very new to R so apologies if these are bad attempts).

Does anyone have any suggestions? Thank you so much! I really appreciate any and all help.

Here is an example of performing multiple linear regressions on a long data frame using purrr instead of loops.

library(dplyr)
library(purrr)
library(broom)

fit_model <- function(df) lm(Sepal.Length ~ Sepal.Width, data = df)
get_slope <- function(mod) tidy(mod)$estimate[2]
get_p_value <- function(mod) tidy(mod)$p.value[2]

iris %>% 
    group_nest(Species) %>% 
    mutate(model = map(data, fit_model),
           slope = map_dbl(model, get_slope),
           p_value = map_dbl(model, get_p_value))
#> # A tibble: 3 × 5
#>   Species                  data model  slope  p_value
#>   <fct>      <list<tibble[,4]>> <list> <dbl>    <dbl>
#> 1 setosa               [50 × 4] <lm>   0.690 6.71e-10
#> 2 versicolor           [50 × 4] <lm>   0.865 8.77e- 5
#> 3 virginica            [50 × 4] <lm>   0.902 8.43e- 4

Created on 2021-07-28 by the reprex package (v2.0.0)

If you need more specific help, please provide a proper REPRoducible EXample (reprex) illustrating your issue.

1 Like

I have switched to nest_by(), another "experimental " function in {dplyr}. It "returns a rowwise data frame, which makes operations on the grouped data particularly elegant."

library(tidyverse)
library(broom)

iris %>% 
  nest_by(Species) %>% 
  mutate(model = list(lm(Sepal.Length ~ Sepal.Width, data))) %>% 
  summarise(tidy(model)) %>% 
  ungroup()
#> `summarise()` has grouped output by 'Species'. You can override using the `.groups` argument.
#> # A tibble: 6 × 6
#>   Species    term        estimate std.error statistic  p.value
#>   <fct>      <chr>          <dbl>     <dbl>     <dbl>    <dbl>
#> 1 setosa     (Intercept)    2.64     0.310       8.51 3.74e-11
#> 2 setosa     Sepal.Width    0.690    0.0899      7.68 6.71e-10
#> 3 versicolor (Intercept)    3.54     0.563       6.29 9.07e- 8
#> 4 versicolor Sepal.Width    0.865    0.202       4.28 8.77e- 5
#> 5 virginica  (Intercept)    3.91     0.757       5.16 4.66e- 6
#> 6 virginica  Sepal.Width    0.902    0.253       3.56 8.43e- 4

Created on 2021-07-27 by the reprex package (v2.0.0)

You were actually not far off. First, specify total_gtrends_deaths_df just once at the start and let the pipe do the work. Otherwise, I think it will group the data but then use the original ungrouped data for each variable. I would include tidy() inside do() and the . at the end of lm() to pass the data to. Without the latter, you will get an error message that it cannot find your variables. I filtered to show only the slope coefficients and not the intercepts.

It is always helpful to supply a sample of your data, perhaps for two of the dma values. We cannot copy and paste data to test with from a screenshot. .

library(tidyverse)
library(broom)

iris  %>% 
  group_by(Species) %>% 
  do(tidy(lm(Sepal.Length ~ Sepal.Width, .))) %>%
  filter(term != "(Intercept)")
#> # A tibble: 3 × 6
#> # Groups:   Species [3]
#>   Species    term        estimate std.error statistic  p.value
#>   <fct>      <chr>          <dbl>     <dbl>     <dbl>    <dbl>
#> 1 setosa     Sepal.Width    0.690    0.0899      7.68 6.71e-10
#> 2 versicolor Sepal.Width    0.865    0.202       4.28 8.77e- 5
#> 3 virginica  Sepal.Width    0.902    0.253       3.56 8.43e- 4

Created on 2021-07-27 by the reprex package (v2.0.0)

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.