Rolling/cumulative average with reset?

dplyr

#1

I'm trying to generate an "up-to-x-months" rolling average that resets after missing data. In other words, the rolling period should vary based on how many of the contiguously preceding months have available data, up to a limit of x months for the lookback period.

I tried using cumsum and a lag, but I couldn't figure out how to do a variable lag.

Any advice on a good approach for this? Perhaps this is doable with zoo or tidyquant?

library(tidyverse)

max_lookback_mo <- 3    # This will be 12 in real use, using 3 for simpler example.

# I have month and total, and would like to calculate the other two columns.
monthly_data <- tribble(
  ~month,   ~total,   ~desired_mos_incl,  ~desired_avg,
  1,        100,      1,                  100,
  2,        200,      2,                  150,
  3,        NA,       NA,                 NA,
  4,        200,      1,                  200,
  5,        300,      2,                  250,
  6,        400,      3,                  300,
  7,        500,      3,                  400,
  8,        420,      3,                  440
)

#2

I'm sure there must be more pretty ways... but this works:

library(tidyverse)

monthly_data <- tribble(
  ~month,   ~total,
  1,        100,
  2,        200,
  3,        NA,
  4,        200,
  5,        300,
  6,        400,
  7,        500,
  8,        420
)

max_lookback_mo <- 3

mos_incl <- integer(nrow(monthly_data))

for (i in 1:(length(monthly_data$total) - 1)) {
  mos_incl[[1]] <- if_else(is.na(monthly_data$total[[1]]), 0, 1)
  mos_incl[[i + 1]] <- if(is.na(monthly_data$total[[i + 1]])) {
        0
  } else if(mos_incl[[i]] + 1 <= max_lookback_mo) {
    mos_incl[[i]] + 1
  } else {
    max_lookback_mo
  }
}

mos_incl[mos_incl == 0] <- NA

monthly_data$mos_incl <- mos_incl

avg <- double(nrow(monthly_data))

for (i in seq_along(monthly_data$mos_incl)) {
  avg[[i]] <- if(is.na(monthly_data$mos_incl[[i]])) {
    NA
  } else if(monthly_data$mos_incl[[i]] == 1) {
    monthly_data$total[[i]]
    } else if(monthly_data$mos_incl[[i]] == 2) {
      mean(c(monthly_data$total[[i]], monthly_data$total[[i - 1]]))
    } else {
      mean(c(monthly_data$total[[i]], monthly_data$total[[i - 1]], monthly_data$total[[i - 2]]))
    }
}

monthly_data$avg <- avg

monthly_data
#> # A tibble: 8 x 4
#>   month total mos_incl   avg
#>   <dbl> <dbl>    <dbl> <dbl>
#> 1     1   100        1   100
#> 2     2   200        2   150
#> 3     3    NA       NA    NA
#> 4     4   200        1   200
#> 5     5   300        2   250
#> 6     6   400        3   300
#> 7     7   500        3   400
#> 8     8   420        3   440

#3

Now, the first loop will work just fine for any value of max_lookback_mo. But the 2nd loop will not since it stops at 3. Of course, you could keep it going until 12, but that would be awfully tedious and it goes against Hadley's rule of no more than 3 copy-pastes of similar code :smile: . So here is a better solution that will work for any lookback value:

library(tidyverse)

monthly_data <- tribble(
  ~month,   ~total,
  1,        100,
  2,        200,
  3,        NA,
  4,        200,
  5,        300,
  6,        400,
  7,        500,
  8,        420
)

max_lookback_mo <- 3

mos_incl <- integer(nrow(monthly_data))

for (i in 1:(length(monthly_data$total) - 1)) {
  mos_incl[[1]] <- if_else(is.na(monthly_data$total[[1]]), 0, 1)
  mos_incl[[i + 1]] <- if(is.na(monthly_data$total[[i + 1]])) {
    0
  } else if(mos_incl[[i]] + 1 <= max_lookback_mo) {
    mos_incl[[i]] + 1
  } else {
    max_lookback_mo
  } 
}

mos_incl[mos_incl == 0] <- NA

monthly_data$mos_incl <- mos_incl

avg <- double(nrow(monthly_data))

for (i in seq_along(monthly_data$mos_incl)) {
  avg[[i]] <- if(is.na(monthly_data$mos_incl[[i]])) {
    NA
  } else {
    mean(monthly_data$total[i:(i - (monthly_data$mos_incl[[i]] - 1))])
  }
}

monthly_data$avg <- avg

monthly_data
#> # A tibble: 8 x 4
#>   month total mos_incl   avg
#>   <dbl> <dbl>    <dbl> <dbl>
#> 1     1   100        1   100
#> 2     2   200        2   150
#> 3     3    NA       NA    NA
#> 4     4   200        1   200
#> 5     5   300        2   250
#> 6     6   400        3   300
#> 7     7   500        3   400
#> 8     8   420        3   440

#4

Another approach:

library(tidyverse)
max_lookback_mo <- 3  # This will be 12 in real use, using 3 for simpler example.

monthly_data <- tribble(~month, ~total, ~desired_mos_incl, ~desired_avg, 1, 
  100, 1, 100, 2, 200, 2, 150, 3, NA, NA, NA, 4, 200, 1, 200, 5, 300, 2, 250, 
  6, 400, 3, 300, 7, 500, 3, 400, 8, 420, 3, 440)

filter_nas <- function(vec) {
  last_na <- max(which(is.na(vec)))
  res <- vec[seq_along(vec) > last_na]
  if (length(res) == 0) 
    return(NA)
  res
}

calc_avg <- function(x, n_months = max_lookback_mo) {
  vec <- (2 - max_lookback_mo):(length(x))
  positions <- purrr::map(1:length(x), function(pos) {
    vec[pos >= vec & (pos - n_months) < vec & vec > 0]
  })
  purrr::map_dbl(positions, function(position) {
    x[position] %>% filter_nas %>% mean
  })
}

res <- monthly_data %>% dplyr::mutate(res = calc_avg(total))

res$res == res$desired_avg
#> [1] TRUE TRUE   NA TRUE TRUE TRUE TRUE TRUE

#5

I am not sure this answers the PO's problem. Unless it is the beginning of the solution? I am a little confused by your suggestion (sorry if I am missing something) :stuck_out_tongue:

I think the PO gave his wished result tibble with 2 "desired" columns. So as I understood it, the problem was to start without these 2 columns and to create them. The rules they set were to derive cumulated means resetting at each NA and using a maximum of max_lookback_mo values. That is what my answer does, but it does it using 2 loops and I am sure that there must be a more elegant solution using functional programming.


#6

To be honest, I didn't get what is the purpose of the months column, so I forgot to include it in the output. Adding number of months is not so different from my first attempt:

library(tidyverse)
max_lookback_mo <- 3  # This will be 12 in real use, using 3 for simpler example.

monthly_data <- tribble(~month, ~total, ~desired_mos_incl, ~desired_avg, 1, 
  100, 1, 100, 2, 200, 2, 150, 3, NA, NA, NA, 4, 200, 1, 200, 5, 300, 2, 250, 
  6, 400, 3, 300, 7, 500, 3, 400, 8, 420, 3, 440)

filter_nas <- function(vec) {
  last_na <- max(which(is.na(vec)))
  res <- vec[seq_along(vec) > last_na]
  res
}

calc_positions <- function(x, n_months = 3) {
  vec <- (2 - n_months):(length(x))
  
  purrr::map(seq_along(x), function(pos) {
    vec[pos >= vec & (pos - n_months) < vec & vec > 0]
  })
}

add_desireds <- function(x, n_months = max_lookback_mo) {
  total_vec <- x[["total"]]
  positions <- calc_positions(total_vec, n_months = n_months)
  
  vecs <- purrr::map(positions, function(position) {
    total_vec[position] %>% filter_nas
  })
  
  x[["avg"]] <- purrr::map_dbl(vecs, mean)
  x[["months"]] <- purrr::map_dbl(vecs, length)
  x
}

res <- monthly_data %>% add_desireds
res
#> # A tibble: 8 x 6
#>   month total desired_mos_incl desired_avg   avg months
#>   <dbl> <dbl>            <dbl>       <dbl> <dbl>  <dbl>
#> 1     1   100                1         100   100      1
#> 2     2   200                2         150   150      2
#> 3     3    NA               NA          NA   NaN      0
#> 4     4   200                1         200   200      1
#> 5     5   300                2         250   250      2
#> 6     6   400                3         300   300      3
#> 7     7   500                3         400   400      3
#> 8     8   420                3         440   440      3

NaN and 0 in two new columns look a little bit more consistent than NAs, but it is a preference, so if you think you absolutely need NA's it is easy to substitute them with, e.g., ifelse.


#7

Sorry. Now I get it :slight_smile: I was confused by the fact that your input data already had the desired result columns (but I guess it would have been easier if the OP had included a reprex + a desired result table). Now I understand what you were doing with res$res == res$desired_avg. Since you were comparing results within a df rather than 2 df, I had missed your point.

So I guess, there was the more elegant functional programming solution :slight_smile:


#8

Thanks @mishabalyasin and @prosoitos, these are great!

I'm using this to look at trends in monthly attendance across a variety of local museums. A few have moved or expanded, so I wanted the rolling average to reset to capture the new trend. I also wanted to track the number of valid months that are included in the rolling average so that I can visually de-emphasize when the average is based on fewer data points.