Imputing time serie with last obs only if last obs is recent enough

dplyr
lubridate

#1

Hi,

I am trying to impute missing data in a time serie, but only if the last available observation is recent enough. Is there a good way to do that? (in the tidyverse ideally or in combination with zoo if that's not achievable)

An example dataset:

max_lag <- months(1)

df <- tribble(
  ~date, ~group, ~score,
  "2015-01-01", "A", 10,
  "2015-02-01", "A", NA,
  "2015-12-01", "A", NA,
  "2015-02-01", "A", 5,
  "2015-01-01", "B", 10,
  "2015-02-01", "B", NA,
  "2015-12-01", "B", NA
)

df$date <- as.Date(df$date)

The desired output would copy down the last available score value only for observations 2 and 6 (where the date of the missing obs is <= to the date of the last available score).

I've been struggling with this for 2 days. :stuck_out_tongue_closed_eyes: Any thoughts?


#2

Hi @Julien, @davis 's tibbletime may be able to help, but I'm not sure.


#3

Not sure tibbletime has a lot to help here. This is kind of a custom special case.

For this specific case you can do:

library(dplyr)
library(tibble)

df <- tribble(
  ~date, ~group, ~score,
  "2015-01-01", "A", 10,
  "2015-02-01", "A", NA,
  "2015-12-01", "A", NA,
  "2015-02-01", "A", 5,
  "2015-01-01", "B", 10,
  "2015-02-01", "B", NA,
  "2015-12-01", "B", NA
)

df$date <- as.Date(df$date)

df %>%
  group_by(group) %>%
  mutate(score_lag = lag(score),
         score_imputed = case_when(
           !is.na(score)     ~ score,
           !is.na(score_lag) ~ score_lag,
           TRUE              ~ NA_real_
         ))
#> # A tibble: 7 x 5
#> # Groups:   group [2]
#>   date       group score score_lag score_imputed
#>   <date>     <chr> <dbl>     <dbl>         <dbl>
#> 1 2015-01-01 A       10.       NA            10.
#> 2 2015-02-01 A       NA        10.           10.
#> 3 2015-12-01 A       NA        NA            NA 
#> 4 2015-02-01 A        5.       NA             5.
#> 5 2015-01-01 B       10.       NA            10.
#> 6 2015-02-01 B       NA        10.           10.
#> 7 2015-12-01 B       NA        NA            NA

If you need this to be more general then there might be more work to do, but this does what you request with this small example.


#4

Thanks Davis!
This technically works here but doesn't account for the time distance to the last available observation.
(Huge fan of tibbletime by the way!! :D)

I finally hacked together a solution below, but it is really not super elegant and I'm wondering if it can scale to a large dataset.

max_lag <- 31 # days

count_lag <- function(x) {
  i <- 0
  lagged_obs <- NA
  while (is.na(lagged_obs) == TRUE) {
    lagged_obs <- lag(x, i)[1]
    i <- i + 1
  }
  return(i)
}

df %>%
  group_by(group) %>%
  mutate(lag_count = as.integer((count_lag(score) - row_number() - 1) * -1),
         date_of_last_obs = lag(date, lag_count[1]),
         diff = date - date_of_last_obs,
         imputed_score = if_else(diff <= max_lag & is.na(score),
                                 lag(score, lag_count[1]),
                                 score))

Is there a more elegant solution?


#5

Another possible solution:

df %>%
  group_by(group) %>%
  mutate(imputed_score = pmax(score,
                              lag(score) * NA^(date - lag(date) > 31) * NA^(!is.na(score)),
                              na.rm = TRUE))

which gives;

# A tibble: 7 x 4
# Groups: group [2]
  date       group score imputed
  <date>     <chr> <dbl>   <dbl>
1 2015-01-01 A     10.0    10.0 
2 2015-02-01 A     NA      10.0 
3 2015-12-01 A     NA      NA   
4 2015-02-01 A      5.00    5.00
5 2015-01-01 B     10.0    10.0 
6 2015-02-01 B     NA      10.0 
7 2015-12-01 B     NA      NA

What this does:

  1. In this I take advantage of the fact that taking the power of NA gives either a 1 or a NA.
  2. NA^0 (or NA^FALSE) will return a 1, while all other (power)values will return NA.
  3. date - lag(date) > 31 returns a logical vector.
  4. Using the result of point 3 in the power of NA (NA^(date - lag(date) > 31)) gives only values where score has an NA-value and the date of the previous score is less or equal than 31 days earlier.
  5. Multiply this with lag(score) and NA^(!is.na(score)). The latter one is needed to prevent unnecessary replacements for when score < lag(score).
  6. Finally use pmax to get the values.

#6

This is a beautiful compact solution!

Can't wait to implement on the real dataset.

Many thanks Jaap :smile:


#7

I've updated the solution to take into account for when score < lag(score).