In April, I posted this question about the code below essentially asking, "Why does this work?" My objective was to impute means for a set of variables based on values in another data frame. It was working perfectly—for NA values in test, it grabbed the mean of that column in train and replaced NA with that value.
However, now it's only using the mean of the last column provided to the mutate_at function. So, previously the two NA values in the income column of test were replaced with 7.17 (the mean of train$income). Now, they're replaced with 3.83 (the mean of train$home_value). The two NA values in the home value column of test are also being replaced by 3.83 (as they were before).
- What change was made to a package that caused this code to behave differently?
- Is there another clean and neat solution that will accomplish my objective?
Thanks.
train <- tibble(income = c(10, 8, 7, 9, 4, 5),
children = c(3, 5, 2, 7, 9, 10),
home_value = c(4, 5, 8, 2, 4, 0))
test <- tibble(income = c(4, 5, 2, NA, 8, NA),
children = c(3, 5, 10, 2, 4, NA),
home_value = c(3, NA, NA, 4, 1, 5))
mvars <- c("income", "home_value")
test_imp <- test %>%
mutate_at(mvars,
list( ~ {
imp_mean <- train %>% pull(.) %>% mean(na.rm = TRUE)
if_else(is.na(.), imp_mean, .)
}))