Loop regression (by year)

Hello everybody,

I am a new user and I would like to ask for help.
My data, named "AAA.csv" , is something like that

stockcode date firmreturn year meanindustry lagmeanindustry marketreturn lagmarketreturn
1 AAA 20161125 NA 2016 -0.00269 NA -0.00341199 NA
2 AAA 20161128 -0.01686 2016 -0.01095 -0.002693882 -0.015777714 -0.00341199
3 AAA 20161129 0 2016 -0.00675 -0.010948497 -0.010623046 -0.015777714
4 AAA 20171130 0 2017 0.003903 -0.006747283 0.010292308 -0.010623046
5 AAA 20171201 -0.04168 2017 -0.00066 0.003902706 0.002207855 0.010292308
6 AAA 20171202 0 2017 -0.00325 -0.000657855 -0.002102608 0.002207855
7 AAA 20181205 0.003713 2018 -0.00339 -0.00324846 -0.007439579 -0.002102608
8 AAA 20181206 -0.00367 2018 -0.01001 -0.003385649 -0.013295919 -0.007439579
9 AAA 20181207 -0.00368 2018 -0.00014 -0.010013394 0.003126391 -0.013295919
10 AAA 20181208 0.003684 2018 0.00141 -0.000137325 0.008168162 0.003126391

I would like to run a regression for each year, then save R-squared for each year in a file.
I would be grateful if anyone can help. Thank you in advance.


lm(AAA$firmreturn ~ AAA$marketreturn+AAA$lagmarketreturn+AAA$meanindustry+AAA$lagmeanindustry)

Hi!
A good place to learn about how to do this is the R for Data Science book, in particular the Many models chapter.

3 Likes

Hi,

First of all, missing data is something the regression model won't like :slight_smile:

  • Get rid of all rows with missing data (that's going to be used)
  • Fill in the missing values with estimates or from other sources

Ok, so now to run your different linear models. It's easier to get rid of the columns you won't need, as this will save time in the formula writing.

library("dplyr")
AAA = AAA %>% select(-stockcode, -date)

We can now write the linear model and make sure we filter for one year. Then remove the year column and use all the rest in the model:

myModel = lm(firmreturn ~., data = AAA %>% filter(year == "2016") %>% select(-year))
Rsquared = summary(myModel)$r.squared

To make it even easier, you can run all models for all years using an extra function:
You'll need to install the 'purrr' package first

finalResult = purrr::map_df(unique(AAA$year), function(modelYear){
  myModel = lm(firmreturn ~., data = AAA %>% filter(year == modelYear) %>% select(-year))
  data.frame(year = modelYear, RSquared = summary(myModel)$r.squared)
})

Hope this helps!
PJ

1 Like

Hi,
Thank you so much for your help. It is perfect.
May I ask for more one question?
You know, I have more than 500 files of stockcode, and saved it as "AAA.csv", "AAM.csv", "ABT.csv"....
I wonder that is there any way to run it in R?
Thanks

If by "run it" you mean reading all files at once, you can do something like this

library(tidyverse)

list_of_files <- list.files(path = "path_to_your_folder",
                            pattern = ".csv$",
                            full.names = TRUE)
stocks <- list_of_files %>%
    setNames(nm = .) %>% 
    map_dfr(read.csv, .id = "file_name")
1 Like

Thank you for your reply. I am still confused.
Could you please clarify the second command in details? What is "file_name"?
In addition, I have run regression for 500 stock files, I have to save in 500 finalResults files. How can I cope with it?
Thank you in advance.

stocks <- list_of_files %>% # This would be the list of the file paths for your 500 .csv files
    setNames(nm = .) %>% # This turns it into a named vector
    map_dfr(read.csv, .id = "file_name") # This applies the read.csv() function to each file, returning a dataframe with the content of all the files merged together

This argument .id = "file_name" is the name of the column that is added as an identifier for the content of each individual file and in this case contains the file path for each file. After this you could nest the dataframe by file_name and fit the regressions for the 500 stocks in one step.

If you need more learning resources about this approach, read the book Pete recommended you.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.