Solution using mutate_at no longer working

wikge · September 13, 2019, 2:18pm

In April, I posted this question about the code below essentially asking, "Why does this work?" My objective was to impute means for a set of variables based on values in another data frame. It was working perfectly—for NA values in test, it grabbed the mean of that column in train and replaced NA with that value.
However, now it's only using the mean of the last column provided to the mutate_at function. So, previously the two NA values in the income column of test were replaced with 7.17 (the mean of train$income). Now, they're replaced with 3.83 (the mean of train$home_value). The two NA values in the home value column of test are also being replaced by 3.83 (as they were before).

What change was made to a package that caused this code to behave differently?
Is there another clean and neat solution that will accomplish my objective?

Thanks.

train <- tibble(income = c(10, 8, 7, 9, 4, 5),
                children = c(3, 5, 2, 7, 9, 10),
                home_value = c(4, 5, 8, 2, 4, 0))
test <- tibble(income = c(4, 5, 2, NA, 8, NA),
               children = c(3, 5, 10, 2, 4, NA),
               home_value = c(3, NA, NA, 4, 1, 5))

mvars <- c("income", "home_value")

test_imp <- test %>%
  mutate_at(mvars,
            list( ~ {
              imp_mean <- train %>% pull(.) %>% mean(na.rm = TRUE)
              if_else(is.na(.), imp_mean, .)
            }))

lionel · September 13, 2019, 3:11pm

First let's see where the 3.83 comes from:

train %>% pull()
#> [1] 4 5 8 2 4 0

# Equivalent to (per magrittr syntax):
train %>% pull(.)
#> [1] 4 5 8 2 4 0

Your code is now equivalent to:

train %>% pull(.) %>% mean()
#> [1] 3.833333

As you found out in that April thread, your code used to work because we were search-and-replacing all instances of . by the name of the variable being mapped. This was a hack. We now use an approach much more similar to the map() family in purrr.

It seems that you want to see both the mapped value and its name. In purrr terms, you want lmap() instead of map(). Unfortunately we don't provide anything like this at the moment.

One alternative might be:

test[mvars] <- map2_dfc(test[mvars], mvars, ~ {
  imp_mean <- train %>% pull(.y) %>% mean(na.rm = TRUE)
  if_else(is.na(.x), imp_mean, .x)
})

test
#> # A tibble: 6 x 3
#>   income children home_value
#>    <dbl>    <dbl>      <dbl>
#> 1   4           3       3
#> 2   5           5       3.83
#> 3   2          10       3.83
#> 4   7.17        2       4
#> 5   8           4       1
#> 6   7.17       NA       5

system · September 20, 2019, 3:11pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.