Calculate a rolling 90 day average of a daily time series with duplicate dates

Hello,

I am a self-taught R user and I have a 2 part question.

Here is a small snippet of the data that I am working on:

library(tidyverse)
df <- tibble(date = c("2018-10-27", "2018-10-26", "2018-10-26", "2018-10-26", "2018-10-25", "2018-10-25"),
             value = c(2, 3, 4, 5, 4, 7))

df <- df %>%
  mutate(date = as.Date(date))
df
> df
# A tibble: 6 x 2
  date       value
  <date>     <dbl>
1 2018-10-27  2.00
2 2018-10-26  3.00
3 2018-10-26  4.00
4 2018-10-26  5.00
5 2018-10-25  4.00
6 2018-10-25  7.00

Notice that there are multiple duplicates for the date column. I am working with daily data. On some dates, there are several entries. On other dates there may be only one or none.

  1. I want to calculate a 90 row moving average

Normally, when I calculate moving averages, I usually use something like this (example with a 3-row moving average):

moving_avg_3_rows <- function(x){
  (lag(x,2) + lag(x,1) + x)/3
}

df_1 <- df %>%
  mutate(value_2 = moving_avg_3_rows(value))
df_1

> df_1
# A tibble: 6 x 3
  date       value value_2
  <date>     <dbl>   <dbl>
1 2018-10-27  2.00   NA   
2 2018-10-26  3.00   NA   
3 2018-10-26  4.00    3.00
4 2018-10-26  5.00    4.00
5 2018-10-25  4.00    4.33
6 2018-10-25  7.00    5.33

However, if i want to calculate a 90-row moving average, it seems really clumsy to have to type out "lag(x,n)" all the way from 89 to 1. Is there a way to write a function that goes up to "n" for the rolling averages, without having to type them all out by hand? I also want to make "n" customizable.

I tried a few clumsy for loops, but cannot seem to figure this out.

  1. I want to calculate a 90 day moving average

The duplicates dates in my data are creating a problem. I tried using some of the functions from the tidyquant package. From what I can understand, the rolling functions cannot be used here because the zoo package requires unique date information as the index.

(side note: I am using this site as a reference https://www.business-science.io/timeseries-analysis/2017/07/23/tidy-timeseries-analysis-pt-2.html)

Based on my understanding, it looks like I have two options:

a) Summarize/collapse the duplicate dates into one date and use that towards the moving average.
b) Introduce some random noise into the date index (e.g. by adding hours, minutes and seconds) so that each duplicate day can then be converted into a unique date/time stamp.

I am not sure how I would go about either of those two routes. Any help would be really appreciated!

Thanks in advance.

You can always use group_by + summarize from dplyr to do that, but you should decide what is the best way to summarize your data. Since you want to have an average, I guess, summing values that happened on the same date makes sense.

1 Like

Thanks for the suggestion @mishabalyasin

I tried fiddling with the group/sum/then average approach. This is what I have now:

df_4 <- df %>%
  group_by(date) %>%
  summarize(avg_by_date =  sum(value)/length(value)) %>%
  arrange(desc(date))
df_4

> df_4
# A tibble: 3 x 2
  date       avg_by_date
  <date>           <dbl>
1 2018-10-27        2.00
2 2018-10-26        4.00
3 2018-10-25        5.50

I need to work with this a little bit longer to see how this affects the overall trends in my larger dataset. But this has definitely opened up some avenues for me :slight_smile:

If you have any thoughts on part 1 of my question (regarding the 90-row averages), that would be appreciated.

Yes, I do :slight_smile:
Not sure this is the most elegant approach, but it's an approach:

library(magrittr)

lag_n_rows <- function(column, n){
  column <- rlang::as_string(rlang::enexpr(column))
  expr <- purrr::map(1:(n - 1), function(i) glue::glue("dplyr::lag({column}, {i})")) %>%
    purrr::reduce(paste, sep = " + ", .init = glue::glue("{column}")) %>%
    {glue::glue("({.})/{n}")} %>%
    rlang::parse_expr()
  rlang::eval_tidy(expr, env = rlang::child_env(.parent = rlang::caller_env()))
}

tibble::as_tibble(mtcars) %>% 
  dplyr::mutate(lag_2 = lag_n_rows(cyl, 2), lag_10 = lag_n_rows(cyl, 10))
#> # A tibble: 32 x 13
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb lag_2
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4    NA
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4     6
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1     5
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1     5
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2     7
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1     7
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4     7
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2     6
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2     4
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4     5
#> # ... with 22 more rows, and 1 more variable: lag_10 <dbl>

Created on 2018-10-30 by the reprex package (v0.2.1)

Hopefully, it is clear what is going on, but personally it took me couple of tries to get environment part correct, so maybe I'm missing something obvious.

2 Likes

Thank you! @mishabalyasin

Your solution works well. I can't say I fully understand it, but for now, it works for me :slight_smile:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.