I am trying to impute missing data in a time serie, but only if the last available observation is recent enough. Is there a good way to do that? (in the tidyverse ideally or in combination with zoo if that's not achievable)
The desired output would copy down the last available score value only for observations 2 and 6 (where the date of the missing obs is <= to the date of the last available score).
I've been struggling with this for 2 days. Any thoughts?
Thanks Davis!
This technically works here but doesn't account for the time distance to the last available observation.
(Huge fan of tibbletime by the way!! :D)
I finally hacked together a solution below, but it is really not super elegant and I'm wondering if it can scale to a large dataset.
max_lag <- 31 # days
count_lag <- function(x) {
i <- 0
lagged_obs <- NA
while (is.na(lagged_obs) == TRUE) {
lagged_obs <- lag(x, i)[1]
i <- i + 1
}
return(i)
}
df %>%
group_by(group) %>%
mutate(lag_count = as.integer((count_lag(score) - row_number() - 1) * -1),
date_of_last_obs = lag(date, lag_count[1]),
diff = date - date_of_last_obs,
imputed_score = if_else(diff <= max_lag & is.na(score),
lag(score, lag_count[1]),
score))
# A tibble: 7 x 4
# Groups: group [2]
date group score imputed
<date> <chr> <dbl> <dbl>
1 2015-01-01 A 10.0 10.0
2 2015-02-01 A NA 10.0
3 2015-12-01 A NA NA
4 2015-02-01 A 5.00 5.00
5 2015-01-01 B 10.0 10.0
6 2015-02-01 B NA 10.0
7 2015-12-01 B NA NA
What this does:
In this I take advantage of the fact that taking the power of NA gives either a 1 or a NA.
NA^0 (or NA^FALSE) will return a 1, while all other (power)values will return NA.
date - lag(date) > 31 returns a logical vector.
Using the result of point 3 in the power of NA (NA^(date - lag(date) > 31)) gives only values where score has an NA-value and the date of the previous score is less or equal than 31 days earlier.
Multiply this with lag(score) and NA^(!is.na(score)). The latter one is needed to prevent unnecessary replacements for when score < lag(score).