Creating a loop for a regression model and store results

Hi all,

I am have the following sample dataset that contains stocks being coded as a number( e.g. 10026), and the Fama-French factors mktrf, hml and smb.

I am trying to perform the following regression:

lm(formula= "`10026` ~ mktrf + hml + smb", data= data ,na.action = na.omit)

This works just fine for the single stock, but I need to write a loop so that the I don't have to run the regression over and over again for every stock, especially since my real data set is much bigger.

Once the regression is completed, I will need to compute the SD of residuals on a monthly basis, so ideally I require a data frame that lists the residuals, and has a date column and the respective stock numbers as columns.

Here are my approaches for the loop so far, which all failed:

# creating a list with dependent variables
depVarList <- names(x = retDailyMerged)[2:7670]

# loop 
lapply(X = depVarList,
       FUN = function(t) lm(formula = paste0("`", t, "` ~ mktrf + hml + smb"),  data = retDailyMerged))

# alternative loop
 for (i in depVarList) 
  {lm(formula= "i ~ mktrf + hml + smb", data= retDailyMerged,na.action = na.omit)}`

-> dataset

structure(list(date = structure(c(17533, 17534, 17535, 17536, 
17539), tzone = "UTC", tclass = "Date", class = "Date"), `10026` = c(NA, 
-0.00998786767581905, 0.0138127158236847, -0.00955052427703185, 
0.000741739716790146), `10028` = c(NA, 0.0102061855670104, -0.0205122971731809, 
-0.0103146488851844, 0.0631645436361723), `10032` = c(NA, 0.00130975769482644, 
0.0122629169391759, 0.00492650621870472, 0.0466929197138954), 
    `93436` = c(NA, -0.0102330515084391, -0.00828999211977932, 
    0.00622970567668935, 0.0626382292829057), mktrf = c(0.0085, 
    0.0059, 0.0042, 0.0066, 0.0019), smb = c(0.0036, -0.0039, 
    -0.0026, -0.0034, -0.0016), hml = c(-0.0022, -0.0021, 0.0024, 
    -0.0026, 7e-04)), class = c("data.table", "data.frame"), row.names = c(NA, 
-5L), sorted = "date")

Hi,

Here is my answer, if I understand your question correctly at least.

Let's use the dataset you provided and call it myData:

myData
        date         10026       10028       10032        93436  mktrf     smb     hml
1 2018-01-02            NA          NA          NA           NA 0.0085  0.0036 -0.0022
2 2018-01-03 -0.0099878677  0.01020619 0.001309758 -0.010233052 0.0059 -0.0039 -0.0021
3 2018-01-04  0.0138127158 -0.02051230 0.012262917 -0.008289992 0.0042 -0.0026  0.0024
4 2018-01-05 -0.0095505243 -0.01031465 0.004926506  0.006229706 0.0066 -0.0034 -0.0026
5 2018-01-08  0.0007417397  0.06316454 0.046692920  0.062638229 0.0019 -0.0016  0.0007

Now we can loop over the different columns of interest and create a linear model for each

#Get all column names to run regression on
depVarList = setdiff(colnames(myData), c("date", "mktrf", "hml", "smb"))

#Loop over them and create model for each
allModels = lapply(depVarList, function(x){
  lm(formula= paste0("`", x, "` ~ mktrf + hml + smb"), 
     data= myData ,na.action = na.omit)
  
})

#Name the list of models to the column name
names(allModels) = depVarList

Finally, we create a dataframe with all the residuals of each test

#Recuperate the columns of interest from the original dataframe
allResiduals = myData %>% select(-mktrf , -hml , -smb)

#Replace all values with residuals (ignore NA)
allResiduals[,-1] = sapply(2:ncol(allResiduals), function(x){
  residuals = allResiduals[,x]
  residuals[!is.na(residuals)] = allModels[[x-1]]$residuals
  residuals
})
allResiduals
        date 10026 10028 10032 93436
1 2018-01-02    NA    NA    NA    NA
2 2018-01-03     0     0     0     0
3 2018-01-04     0     0     0     0
4 2018-01-05     0     0     0     0
5 2018-01-08     0     0     0     0

In this case, all residuals happen to be 0, but that's probably because of the limited data.

Hope this helps,
PJ

1 Like

A tidyverse based option, including the standard deviation of residuals on a monthly basis

sample_data <- data.frame(
    date = as.Date(c("2018-01-02", "2018-01-03", "2018-01-04", "2018-01-05",
                     "2018-01-08")),
    X10026 = c(NA, -0.00998786767581905, 0.0138127158236847,
               -0.00955052427703185, 0.000741739716790146),
    X10028 = c(NA, 0.0102061855670104, -0.0205122971731809,
               -0.0103146488851844, 0.0631645436361723),
    X10032 = c(NA, 0.00130975769482644, 0.0122629169391759,
               0.00492650621870472, 0.0466929197138954),
    X93436 = c(NA, -0.0102330515084391, -0.00828999211977932,
               0.00622970567668935, 0.0626382292829057),
    mktrf = c(0.0085, 0.0059, 0.0042, 0.0066, 0.0019),
    smb = c(0.0036, -0.0039, -0.0026, -0.0034, -0.0016),
    hml = c(-0.0022, -0.0021, 0.0024, -0.0026, 7e-04)
)

library(tidyverse)
library(tsibble)

sample_data %>% 
    gather(stock, return, starts_with("X")) %>% 
    group_nest(stock) %>% 
    mutate(model = map(data,
                       ~lm(formula = "return ~ mktrf + hml + smb",
                           data = .x,
                           na.action = na.exclude)
                       ),
           resid = map(model, residuals)
           ) %>% 
    unnest(c(data,resid)) %>% 
    na.omit(return) %>%
    as_tsibble(key = stock, index = date) %>% 
    group_by_key() %>%
    index_by(year_month = ~ yearmonth(.)) %>%
    summarise(sd_residual = sd(resid))
#> # A tsibble: 4 x 3 [?]
#> # Key:       stock [4]
#>   stock  year_month sd_residual
#>   <chr>       <mth>       <dbl>
#> 1 X10026   2018 ene           0
#> 2 X10028   2018 ene           0
#> 3 X10032   2018 ene           0
#> 4 X93436   2018 ene           0

Created on 2019-09-20 by the reprex package (v0.3.0.9000)

1 Like

This code works perfectly well once I cleaned the data to exclude all columns with NA @pieterjanvc . I also get the residuals when using it on my complete data set. Many thanks!!!

@andresrcs, thanks for the tidy verse solution! Very helpful for my future analysis. The only thing is that I get the following error message: "Column stock is unknown". Any idea what I am doing differently?

The stock column is created on this step gather(stock, return, starts_with("X")) %>% so I see no evident cause for that error message. Could you provide a REPRoducible EXample (reprex) for that issue?

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.