Calculating the Mean of the previous 6 data points, but for every nth row

I am having a data table with a date, a unique company name and its stock return that looks a bit like this:

require(data.table)

DATE <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
COMP <- c(replicate(60, "AAPL"), replicate(60, "INTL"), replicate(60, "GOOGL"), replicate(60, "MSFT"))
RET <- rnorm(240)
test1 <- data.table(DATE, COMP, RET)

And now I ty to calculate the previous average 6 same-month calendar return for each data point. So e.g. for the AAPL stock return in January 1990, I want the mean from the returns of Jan89, Jan 88, ... and Jan84 in a new column next to the Jan90 Return. I tried to get accustomed to skipping rows, but now I'm a bit stuck. Here is the code that I used to play a bit with the mechanic:

test1$new1 <- test1$RET[seq.int(from = 1L, to = 20L, by = 6L)]
test1$new2 <- test1$RET[seq.int(from = -20L, to = 0L, by = 6L)]
n = 6
test1$new3 <- rowMeans(test1[seq(from = 1, to = nrow(test1), by = n),])

Does anyone have an idea, how to do this? PS: I am not committed to data-table, I just enjoy this package a lot so far.

First of all, this can certainly be done in data.table, I just don't use it, so dplyr is more natural.
Second, you can't really get 6 same-month calendar return with only 5 years of data (since you have only 5 same-month calendar returns to begin with). But code below should give you an idea of how to do it with your data:

library(tidyverse)
library(TTR)
#> ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1     ✔ purrr   0.2.5
#> ✔ tibble  1.4.2     ✔ dplyr   0.7.5
#> ✔ tidyr   0.8.1     ✔ stringr 1.3.1
#> ✔ readr   1.1.1     ✔ forcats 0.3.0
#> ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
DATE <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", 
  "Nov", "Dec")
COMP <- c(replicate(60, "AAPL"), replicate(60, "INTL"), replicate(60, "GOOGL"), 
  replicate(60, "MSFT"))

set.seed(42L)
RET <- rnorm(240)
YEAR <- purrr::map(c(1981:1985), ~rep(.x, 12)) %>% purrr::flatten_int() %>% 
  rep(., 4)

df <- data.frame(DATE, COMP, RET, YEAR) %>% 
  dplyr::arrange(YEAR) %>% 
  dplyr::group_by(DATE, COMP) %>% 
  dplyr::mutate(lag_RET = dplyr::lag(RET, n = 1)) %>% 
  dplyr::mutate(lag_mean = TTR::runMean(x = lag_RET, n = 2, cumulative = TRUE))

You first add lag variable inside of your dataframe and then use dplyr::mutate with windowed runMean function to calculate what you need. You can read help file for this function, it'll explain what cumulative and n are used for.

2 Likes

Here's a data.table version (with zoo) to get you started:

test1[, rollmean := zoo::rollmeanr(RET, 2, fill = NA), .(DATE, COMP)]

As you only have 5 months per comp, I have made the example for 2 months for illustration.

2 Likes

Thank you for your replies! That looks really good, but is there a way to exclude the cell value itself? So that only the previous values form the mean? Just a hint would be enough, I don't want to bother you all the time and want to learn coding by doing it myself. Thanks a lot!

2 Likes

For data.table check the shift() function with type="lag".

For dplyr you can use lag().

You may need to experiment to get a solution which works for all your possible scenarios, e.g. where the window size is greater than the number of data points.

It would also be worth your while changing the date from just being the month to add the year (or even just using a date) so that you don't lose the order if the data gets rearranged.

Yes I think there might be problem with the window size for my data set. If I use mishabalyasin solution, it gives back:

Error in mutate_impl(.data, dots) : Columnlag_RETmust be length 2 (the group size) or one, not 910640

And if I try your solution, it returns

Error in[.data.table(CRSP_dttest, ,:=(rollmean, zoo::rollmeanr(demean_ret, : Type of RHS ('double') must match LHS ('logical'). To check and coerce would impact performance too much for the fastest cases. Either change the type of the target column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1)

I tried to make it all numerical (with as.numerical()) but that doesn't do the trick appparently, so I will see if I can do anything else....