Manually calculated imputation is different from what step_knnimpute produces

nyk · September 16, 2020, 2:43am

I conducted a little experiment to understand how step_knnimpute works. During the process, I found that some of my manually calculated imputations are different from those provided by step_knnimpute.

Below is a simple example.

Any guidance to help me to understand this will be much appreciated.

library(tidyverse)
library(tidymodels)
#> -- Attaching packages ---------------------------------------------------------------- tidymodels 0.1.1 --
#> v broom     0.7.0          v recipes   0.1.13    
#> v dials     0.0.8          v rsample   0.0.7     
#> v infer     0.5.3          v tune      0.1.1     
#> v modeldata 0.0.2          v workflows 0.1.2     
#> v parsnip   0.1.2.9000     v yardstick 0.0.7
#> -- Conflicts ------------------------------------------------------------------- tidymodels_conflicts() --
#> x scales::discard() masks purrr::discard()
#> x dplyr::filter()   masks stats::filter()
#> x recipes::fixed()  masks stringr::fixed()
#> x dplyr::lag()      masks stats::lag()
#> x yardstick::spec() masks readr::spec()
#> x recipes::step()   masks stats::step()
library(gower)
library(reprex)

samp <- tribble(
  ~y, ~x1, ~x2, ~x3,
  1, 0.2, 0.15, 0.4,
  0,  0.35, 0.1, 0.39,
  0, 0.55, 0.24, 0.36,
  1, 0.17,NA, 0.22, 
  0, NA, 0.33, 0.12)

tr <- samp[5, ]
te <- samp[1:4, ]

gdis <- gower_dist(tr, te)
names(gdis) <- c("dis_with_obs1", "dis_with_obs2", "dis_with_obs3", "dis_with_obs4")

gdis
#> dis_with_obs1 dis_with_obs2 dis_with_obs3 dis_with_obs4 
#>     0.9275362     0.6547619     0.4161491     0.6785714

# if neighbors = 3, then the imputed value for x1 of observation 5 is
# the mean of x1 for observations 2 to 4 
mean(samp$x1[2:4])
#> [1] 0.3566667

# if recipe is used, the result is different
rec <- recipe(y ~ x1 + x2 + x3, data = samp)

ratio_recipe <- rec %>%
  step_knnimpute(all_predictors(), neighbors = 3)

imputed <- prep(ratio_recipe) %>% juice()

#the imputation is based on the average of observations 1, 3 and 4
imputed$x1[5]
#> [1] 0.3066667

^{Created on 2020-09-16 by the reprex package (v0.3.0)}

system · October 7, 2020, 2:43am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.