I conducted a little experiment to understand how step_knnimpute
works. During the process, I found that some of my manually calculated imputations are different from those provided by step_knnimpute
.
Below is a simple example.
Any guidance to help me to understand this will be much appreciated.
library(tidyverse)
library(tidymodels)
#> -- Attaching packages ---------------------------------------------------------------- tidymodels 0.1.1 --
#> v broom 0.7.0 v recipes 0.1.13
#> v dials 0.0.8 v rsample 0.0.7
#> v infer 0.5.3 v tune 0.1.1
#> v modeldata 0.0.2 v workflows 0.1.2
#> v parsnip 0.1.2.9000 v yardstick 0.0.7
#> -- Conflicts ------------------------------------------------------------------- tidymodels_conflicts() --
#> x scales::discard() masks purrr::discard()
#> x dplyr::filter() masks stats::filter()
#> x recipes::fixed() masks stringr::fixed()
#> x dplyr::lag() masks stats::lag()
#> x yardstick::spec() masks readr::spec()
#> x recipes::step() masks stats::step()
library(gower)
library(reprex)
samp <- tribble(
~y, ~x1, ~x2, ~x3,
1, 0.2, 0.15, 0.4,
0, 0.35, 0.1, 0.39,
0, 0.55, 0.24, 0.36,
1, 0.17,NA, 0.22,
0, NA, 0.33, 0.12)
tr <- samp[5, ]
te <- samp[1:4, ]
gdis <- gower_dist(tr, te)
names(gdis) <- c("dis_with_obs1", "dis_with_obs2", "dis_with_obs3", "dis_with_obs4")
gdis
#> dis_with_obs1 dis_with_obs2 dis_with_obs3 dis_with_obs4
#> 0.9275362 0.6547619 0.4161491 0.6785714
# if neighbors = 3, then the imputed value for x1 of observation 5 is
# the mean of x1 for observations 2 to 4
mean(samp$x1[2:4])
#> [1] 0.3566667
# if recipe is used, the result is different
rec <- recipe(y ~ x1 + x2 + x3, data = samp)
ratio_recipe <- rec %>%
step_knnimpute(all_predictors(), neighbors = 3)
imputed <- prep(ratio_recipe) %>% juice()
#the imputation is based on the average of observations 1, 3 and 4
imputed$x1[5]
#> [1] 0.3066667
Created on 2020-09-16 by the reprex package (v0.3.0)