In April, I posted this question about the code below essentially asking, "Why does this work?" My objective was to impute means for a set of variables based on values in another data frame. It was working perfectly—for NA
values in test
, it grabbed the mean of that column in train
and replaced NA
with that value.
However, now it's only using the mean of the last column provided to the mutate_at function. So, previously the two NA
values in the income column of test
were replaced with 7.17 (the mean of train$income
). Now, they're replaced with 3.83 (the mean of train$home_value
). The two NA
values in the home value column of test
are also being replaced by 3.83 (as they were before).
- What change was made to a package that caused this code to behave differently?
- Is there another clean and neat solution that will accomplish my objective?
Thanks.
train <- tibble(income = c(10, 8, 7, 9, 4, 5),
children = c(3, 5, 2, 7, 9, 10),
home_value = c(4, 5, 8, 2, 4, 0))
test <- tibble(income = c(4, 5, 2, NA, 8, NA),
children = c(3, 5, 10, 2, 4, NA),
home_value = c(3, NA, NA, 4, 1, 5))
mvars <- c("income", "home_value")
test_imp <- test %>%
mutate_at(mvars,
list( ~ {
imp_mean <- train %>% pull(.) %>% mean(na.rm = TRUE)
if_else(is.na(.), imp_mean, .)
}))