Replace NA with previous values' average

I would like to replace NA values with previous two months' average. For example, region A in Dec 2014 has missing v1 for males. How do I replace the NA with the average of region A's previous two values of Dec-Male v1?

|        v1|region |gender | year|month |
|---------:|:------|:------|----:|:-----|
|        NA|A      |Male   | 2014|Dec   |
| 12.055853|A      |Female | 2014|Dec   |
| 10.516935|B      |Male   | 2014|Dec   |
|  2.251625|B      |Female | 2014|Dec   |
| 11.230126|C      |Male   | 2014|Dec   |
|        NA|C      |Female | 2014|Dec   |
#----------------
# Data Example
#----------------
toy_data <- tibble(
  var_missing = rnorm(mean = 10, sd = 3, n = 360),
  region = rep(c("A", "B", "C"), each = 2, times = 60),
  gender = rep(c("Male","Female"), times = 180),
  year = rep(2010:2014, each = 72),
  month = rep(1:12, each = 6, times = 5)) %>% 
  mutate(month = month(month, label = TRUE))     
  
# replace with NA
toy_data[355, 1] <- NA
toy_data[360, 1] <- NA

Hi and welcome, @budugulo! Here's on approach using the relatively new, slider package. I've found it very useful to create moving averages and the like with a tidyverse-friendly approach.

You can use the .before and .after arguments of slide_dbl() to define the window you want to use as you iterate over the rows. Here, we say look back 2 rows, and don't include the current row. If you do this on a grouped data frame, it respects the groups, and will look back within the group -- in this case, region, gender, and month.

library(dplyr, warn.conflicts = FALSE)
library(slider, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)

toy_data <- tibble(
    var_missing = rnorm(mean = 10, sd = 3, n = 360),
    region = rep(c("A", "B", "C"), each = 2, times = 60),
    gender = rep(c("Male","Female"), times = 180),
    year = rep(2010:2014, each = 72),
    month = rep(1:12, each = 6, times = 5)
  ) %>%
  mutate(month = month(month, label = TRUE))
  
# replace with NA
toy_data[355, 1] <- NA
toy_data[360, 1] <- NA

toy_data %>% 
  group_by(region, gender, month) %>% 
  mutate(
    var_missing = 
      if_else(
        is.na(var_missing),
        slider::slide_dbl(var_missing, mean, .before = 2, .after = -1),
        var_missing
        )
    ) %>% 
  ungroup()
#> # A tibble: 360 x 5
#>    var_missing region gender  year month
#>          <dbl> <chr>  <chr>  <int> <ord>
#>  1        6.24 A      Male    2010 Jan  
#>  2        8.40 A      Female  2010 Jan  
#>  3        9.71 B      Male    2010 Jan  
#>  4        7.50 B      Female  2010 Jan  
#>  5        8.08 C      Male    2010 Jan  
#>  6        6.72 C      Female  2010 Jan  
#>  7       10.0  A      Male    2010 Feb  
#>  8        9.42 A      Female  2010 Feb  
#>  9       12.2  B      Male    2010 Feb  
#> 10       14.2  B      Female  2010 Feb  
#> # ... with 350 more rows

Created on 2020-06-23 by the reprex package (v0.3.0)

1 Like

Many thanks for introducing the slider package @mfherman! Your code does the job excellently. Thanks again.

If your question's been answered (even by you!), would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it:

1 Like

@mfherman I run into problem when the variable contains consecutive NAs; slider::slide_dbl only replaces the first NA. For example in the following example data, observation 3 gets 80, but observations 3 and 4 do not get replaced. Is there any way overcome this issue?

|month |region | var_missing|
|:-----|:------|-----------:|
|Dec   |A      |        59.5|
|Dec   |A      |       100.5|
|Dec   |A      |          NA|
|Dec   |A      |          NA|
|Dec   |A      |          NA|
|Jan   |B      |        45.0|
library(tidyverse)
library(slider)

# Example Data
toy_data <- tibble(
  month = rep(c("Dec", "Jan"), each = 5, times = 2),
  region = rep(c("A", "B"), each = 5, times = 2),
  var_missing = c(59.5, 100.5, NA, NA, NA, 45, 50, 70, 90, NA, 50:58, NA)
)

# Replace NA with moving average
toy_data %>% 
  group_by(region, month) %>% 
  mutate(
    var_missing = 
      if_else(
        is.na(var_missing),
        slider::slide_dbl(var_missing, mean, .before = 2, .after = -1),
        var_missing
      )
  ) %>% 
  ungroup()
#> # A tibble: 20 x 3
#>    month region var_missing
#>    <chr> <chr>        <dbl>
#>  1 Dec   A             59.5
#>  2 Dec   A            100. 
#>  3 Dec   A             80  
#>  4 Dec   A             NA  
#>  5 Dec   A             NA  
#>  6 Jan   B             45  
#>  7 Jan   B             50  
#>  8 Jan   B             70  
#>  9 Jan   B             90  
#> 10 Jan   B             80  
#> 11 Dec   A             50  
#> 12 Dec   A             51  
#> 13 Dec   A             52  
#> 14 Dec   A             53  
#> 15 Dec   A             54  
#> 16 Jan   B             55  
#> 17 Jan   B             56  
#> 18 Jan   B             57  
#> 19 Jan   B             58  
#> 20 Jan   B             57.5

Created on 2020-06-25 by the reprex package (v0.3.0)

You can set na.rm = TRUE, and then it will ignore the NA value, but when there are two NAs in a row, what would you expect the value to be?

library(tidyverse)
library(slider)

# Example Data
toy_data <- tibble(
  month = rep(c("Dec", "Jan"), each = 5, times = 2),
  region = rep(c("A", "B"), each = 5, times = 2),
  var_missing = c(59.5, 100.5, NA, NA, NA, 45, 50, 70, 90, NA, 50:58, NA)
)

# Replace NA with moving average
toy_data %>% 
  group_by(region, month) %>% 
  mutate(
    var_missing = 
      if_else(
        is.na(var_missing),
        slider::slide_dbl(var_missing, mean, na.rm = TRUE, .before = 2, .after = -1),
        var_missing
      )
  ) %>% 
  ungroup()
#> # A tibble: 20 x 3
#>    month region var_missing
#>    <chr> <chr>        <dbl>
#>  1 Dec   A             59.5
#>  2 Dec   A            100. 
#>  3 Dec   A             80  
#>  4 Dec   A            100. 
#>  5 Dec   A            NaN  
#>  6 Jan   B             45  
#>  7 Jan   B             50  
#>  8 Jan   B             70  
#>  9 Jan   B             90  
#> 10 Jan   B             80  
#> 11 Dec   A             50  
#> 12 Dec   A             51  
#> 13 Dec   A             52  
#> 14 Dec   A             53  
#> 15 Dec   A             54  
#> 16 Jan   B             55  
#> 17 Jan   B             56  
#> 18 Jan   B             57  
#> 19 Jan   B             58  
#> 20 Jan   B             57.5

Created on 2020-06-25 by the reprex package (v0.3.0)

Thank you for your feedback @mfherman. Isn't it possible to update the values on rolling basis i.e. calculation would be based on previous calculation whenever there is NA?

If it is not doable by slider::slide_dbl, could you suggest any other option to interpolate? The idea is NA value would be replaced by previous 2 values' average.

Ahh, I see. So you would need the prior row value to be updated to the average before it is used. I'm not sure if there is a way to do that using the mutate() + slide() pattern and it might require a loop or something similar. When I have some time tomorrow I'll look into other solutions.

1 Like

I hope theres a better solution, but i extended mfhermans approach with a further step that tries to fill the gaps. I'm not superconfident about it, but for the toy_data it appears to work

#assumes mfhermans toy_data and initial slider transform has occurred
fix <- function(df,...){
  nanindex <- which(is.nan(df$var_missing))
  if(length(nanindex)>0){
    allpopindex <- which(!is.nan(df$var_missing))
    popindex <- which(allpopindex<min(nanindex))
  lastpop <- max(popindex)
  df$var_missing[nanindex]<-df$var_missing[[lastpop]]
  }
  return(df)
}
df1 %>% group_by(region,month) %>% group_map(.f=fix,keep=TRUE) %>% bind_rows()

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.