How to perform group-wise linear regression for a data frame in R

I have updated the dplyr and getting this error now

Error in UseMethod("nest_by") :
no applicable method for 'nest_by' applied to an object of class "function"

mtcars |> 
  nest_by(cyl)

Can you run this ?

yes i am able to run this

mtcars |>

  • nest_by(cyl)

A tibble: 3 × 2

Rowwise: cyl

cyl                data

<list<tibble[,10]>>
1 4 [11 × 10]
2 6 [7 × 10]
3 8 [14 × 10]

Can you provde an example of the code you are trying that gives you an error ?

groupLM <- sample|>
nest_by(bank_year) |>
mutate(lm_model = list(lm(y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x13+x14, d = sample)))

this is a good example of a general lesson; choosing good names for our objects; and preferring names that don't clash with base R function names.
base::sample is a function; if you have a data.frame related to some sample, consider names like sample_df etc.

The code is working fine but the results are identical. I belive the result of single bank-year combination in copied in all regression.

library(dplyr)
library(tidyverse)
library(broom)
data_5 <- read.csv("data_sample.csv")

y <- data_5$nse_returns
x1 <- data_5$auto
x2 <- data_5$consumer_durables
x3 <- data_5$FMCG
x4 <- data_5$healthcare
x5 <- data_5$IT
x6 <- data_5$media
x7 <- data_5$metal
x8 <- data_5$oil_gas
x9 <- data_5$pharma
x10 <- data_5$reality
x11 <- data_5$finance
x12 <- data_5$Mkt.RF
x13 <- data_5$SMB
x14 <- data_5$HML
groupLM <- data_5 |>

  • nest_by(bank_year) |>
  • mutate(lm_model = list(lm(y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x13+x14, d = data_5)))

groupLM

A tibble: 196 × 3

Rowwise: bank_year

bank_year data lm_model
<list<tibble[,18]>>
1 ALD2018 [246 × 18]
2 ALD2019 [244 × 18]
3 ALD2020 [55 × 18]
4 ANDHRA2018 [246 × 18]
5 ANDHRA2019 [244 × 18]
6 ANDHRA2020 [55 × 18]
7 AUSF2018 [246 × 18]
8 AUSF2019 [244 × 18]
9 AUSF2020 [250 × 18]
10 AUSF2021 [248 × 18]

… with 186 more rows

:information_source: Use print(n = ...) to see more rows

groupLM |> reframe(glance(lm_model))

A tibble: 196 × 13

bank_year r.squ…¹ adj.r…² sigma stati…³ p.value df logLik AIC BIC devia…⁴ df.re…⁵ nobs

1 ALD2018 0.0939 0.0936 0.0278 321. 0 14 93823. -1.88e5 -1.87e5 33.5 43343 43358
2 ALD2019 0.0939 0.0936 0.0278 321. 0 14 93823. -1.88e5 -1.87e5 33.5 43343 43358
3 ALD2020 0.0939 0.0936 0.0278 321. 0 14 93823. -1.88e5 -1.87e5 33.5 43343 43358
4 ANDHRA2018 0.0939 0.0936 0.0278 321. 0 14 93823. -1.88e5 -1.87e5 33.5 43343 43358
5 ANDHRA2019 0.0939 0.0936 0.0278 321. 0 14 93823. -1.88e5 -1.87e5 33.5 43343 43358
6 ANDHRA2020 0.0939 0.0936 0.0278 321. 0 14 93823. -1.88e5 -1.87e5 33.5 43343 43358
7 AUSF2018 0.0939 0.0936 0.0278 321. 0 14 93823. -1.88e5 -1.87e5 33.5 43343 43358
8 AUSF2019 0.0939 0.0936 0.0278 321. 0 14 93823. -1.88e5 -1.87e5 33.5 43343 43358
9 AUSF2020 0.0939 0.0936 0.0278 321. 0 14 93823. -1.88e5 -1.87e5 33.5 43343 43358
10 AUSF2021 0.0939 0.0936 0.0278 321. 0 14 93823. -1.88e5 -1.87e5 33.5 43343 43358

… with 186 more rows, and abbreviated variable names ¹​r.squared, ²​adj.r.squared, ³​statistic,

⁴​deviance, ⁵​df.residual

:information_source: Use print(n = ...) to see more rows

Please format your post.

My advice is to think about the param that lm takes to establish the data it should use. If the nest operation produced an appropriate table and had it in a list column called data , then its that that should be used, and certainly not the entire unnested dataset (data_5)

I gave similar recommendation when do was discussed.

> data_5 <- read.csv("data_sample.csv")
> y <- data_5$nse_returns
> x1 <- data_5$auto
> x2 <- data_5$consumer_durables
> x3 <- data_5$FMCG
> x4 <- data_5$healthcare
> x5 <- data_5$IT
> x6 <- data_5$media
> x7 <- data_5$metal
> x8 <- data_5$oil_gas
> x9 <- data_5$pharma
> x10 <- data_5$reality
> x11 <- data_5$finance
> x12 <- data_5$Mkt.RF
> x13 <- data_5$SMB
> x14 <- data_5$HML
> groupLM <- data_5 |> 
+   nest_by(bank_year) |> 
+   mutate(lm_model = list(lm(y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x13+x14, d = data_5)))
> groupLM
# A tibble: 196 × 3
# Rowwise:  bank_year
   bank_year                 data lm_model
   <chr>      <list<tibble[,18]>> <list>  
 1 ALD2018             [246 × 18] <lm>    
 2 ALD2019             [244 × 18] <lm>    
 3 ALD2020              [55 × 18] <lm>    
 4 ANDHRA2018          [246 × 18] <lm>    
 5 ANDHRA2019          [244 × 18] <lm>    
 6 ANDHRA2020           [55 × 18] <lm>    
 7 AUSF2018            [246 × 18] <lm>    
 8 AUSF2019            [244 × 18] <lm>    
 9 AUSF2020            [250 × 18] <lm>    
10 AUSF2021            [248 × 18] <lm>    
# … with 186 more rows
# ℹ Use `print(n = ...)` to see more rows

> groupLM |> reframe(glance(lm_model))

# A tibble: 196 × 13
   bank_year  r.squ…¹ adj.r…²  sigma stati…³ p.value    df logLik     AIC     BIC devia…⁴ df.re…⁵  nobs
   <chr>        <dbl>   <dbl>  <dbl>   <dbl>   <dbl> <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <int> <int>
 1 ALD2018     0.0939  0.0936 0.0278    321.       0    14 93823. -1.88e5 -1.87e5    33.5   43343 43358
 2 ALD2019     0.0939  0.0936 0.0278    321.       0    14 93823. -1.88e5 -1.87e5    33.5   43343 43358
 3 ALD2020     0.0939  0.0936 0.0278    321.       0    14 93823. -1.88e5 -1.87e5    33.5   43343 43358
 4 ANDHRA2018  0.0939  0.0936 0.0278    321.       0    14 93823. -1.88e5 -1.87e5    33.5   43343 43358
 5 ANDHRA2019  0.0939  0.0936 0.0278    321.       0    14 93823. -1.88e5 -1.87e5    33.5   43343 43358
 6 ANDHRA2020  0.0939  0.0936 0.0278    321.       0    14 93823. -1.88e5 -1.87e5    33.5   43343 43358
 7 AUSF2018    0.0939  0.0936 0.0278    321.       0    14 93823. -1.88e5 -1.87e5    33.5   43343 43358
 8 AUSF2019    0.0939  0.0936 0.0278    321.       0    14 93823. -1.88e5 -1.87e5    33.5   43343 43358
 9 AUSF2020    0.0939  0.0936 0.0278    321.       0    14 93823. -1.88e5 -1.87e5    33.5   43343 43358
10 AUSF2021    0.0939  0.0936 0.0278    321.       0    14 93823. -1.88e5 -1.87e5    33.5   43343 43358
# … with 186 more rows, and abbreviated variable names ¹​r.squared, ²​adj.r.squared, ³​statistic,
#   ⁴​deviance, ⁵​df.residual
# ℹ Use `print(n = ...)` to see more rows

> groupLM |> reframe(tidy(lm_model))

# A tibble: 2,940 × 6
   bank_year term        estimate std.error statistic  p.value
   <chr>     <chr>          <dbl>     <dbl>     <dbl>    <dbl>
 1 ALD2018   (Intercept) 1.00     0.000134   7479.    0       
 2 ALD2018   x1          0.000887 0.000161      5.51  3.52e- 8
 3 ALD2018   x2          0.000728 0.000168      4.33  1.51e- 5
 4 ALD2018   x3          0.000531 0.000208      2.56  1.06e- 2
 5 ALD2018   x4          0.00116  0.000634      1.82  6.86e- 2
 6 ALD2018   x5          0.000116 0.000181      0.639 5.23e- 1
 7 ALD2018   x6          0.00144  0.0000893    16.2   1.05e-58
 8 ALD2018   x7          0.000675 0.000107      6.33  2.40e-10
 9 ALD2018   x8          0.00289  0.000185     15.7   3.81e-55
10 ALD2018   x9          0.000389 0.000556      0.699 4.85e- 1
# … with 2,930 more rows
# ℹ Use `print(n = ...)` to see more rows

This is just my opinion but to me, without getting extra context from you that would explain/justify/motivate this; this stuff seems both self-defeating; and pointless extra work ?

Practically; the negative impact of having done this is that given these (x1-x14) things dont exist in the data_5 that you nest. so when they appear in your lm formula; lm is possibly too smart for its own good and goes directly to the objects you named out (y, x1,x2) and so its no longer possibly data driven by any nesting; and you have persisted in repeating to pass data_5 as a d= param, when I've told you two previous times that this does not work and should be the product of the nest...

question 1) do you have a requirement to hide the actual variable names and sub them for non-descriptive names such as x1-x14 ?
if you do we can talk about good approaches; but I would guess that you dont ...

There is no requirement to hide the actual variable names. I didn't used the actual names, not to make the model messy. I am not expert in R, I dont know many protocol. I am really sorry in case my silly mistakes are displeasing you. If i use the actual names will it work?

The actual names are y is the nse returns and x is the (auto to HML)

date name year bank_year nse_returns auto consumer_durables FMCG healthcare IT media metal oil_gas pharma reality finance Mkt.RF SMB HML RF
01-01-2008 ALD 2008 ALD2008 1.09 0.871 -0.528 2.199 -0.097 -1.308 1.599 0.195 -0.26 -0.255 2.907 0.234 0.02 0.01 -0.01 0.01
01-02-2008 ALD 2008 ALD2008 1.02 1.611 -2.091 0.07 -0.85 2.845 -5.311 -0.997 -0.23 -1.14 -4.486 -2.192 1.26 -0.05 -0.14 0.01
01-04-2008 ALD 2008 ALD2008 1 -0.812 0.94 1.871 -0.649 -1.065 -0.586 -1.967 2.471 -0.979 -0.638 -1.127 1.95 -1.41 0.19 0.01
01-07-2008 ALD 2008 ALD2008 0.96 -1.906 -0.632 -1.026 -0.429 0.44 -2.274 -0.653 -0.352 -0.604 -1.537 -1.366 -0.66 0.09 -0.16 0.01
01-08-2008 ALD 2008 ALD2008 1.01 -1.92 -0.546 -1.826 -0.348 1.442 -0.589 0.354 0.844 0.242 -1.053 1.491 -1.09 0.51 0.36 0.01

I've attempted to go through and apply EconProfs approach to what we understand of your data, and model needs. I've tried to be more explicit than is needed; by renaming the results of the nest_by and using that name as appropriate within lm()


data_5 <- read.csv("data_sample.csv")

groupLM <- data_5 |> 
  nest_by(bank_year,
          .key = "nested_data") |> 
  mutate(lm_model = list(lm(nse_returns ~auto +
                            consumer_durables +
                            FMCG +
                            healthcare +
                            IT +
                            media +
                            metal +
                            oil_gas +
                            pharma +
                            reality +
                            finance +
                            Mkt.RF +
                            SMB +
                            HML, d = nested_data)))

groupLM |> reframe(glance(lm_model))

groupLM |> reframe(tidy(lm_model))
1 Like

Thank you very very much this worked.

I have posted the sample sample data. can you please help me out..

.

Did you post in the wrong thread ?

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.