I am currently working on a project in which I'd like to use the tidymodels framework and more precisely the {recipes} package for my data preprocessing.
I need to cover various imputation steps. Among others I would like to use step_lowerimpute() to impute missing values for a feature with the minimum value in the training data.
However, I'm running into trouble here. While the training of the recipe including this step does work (i.e. the minimum value in the training data is being stored), applying the recipe to any data does not have any effect. Missing values won't be imputed by the recipe.
I don't know, if it's just me who's making some mistake, but I could not figure anythig out, not even with the following very simple reprex:
library(recipes)
library(dplyr)
# data
set.seed(123)
df <- tibble(
a = letters[1:10],
b = rnorm(10),
c = c(rep(1, 3), rep(2, 2), rep(NA, 2), rep(10, 3))
)
# recipe
rec <- recipe(~ ., data = df)
rec_imp <- rec %>%
step_lowerimpute(c)
# trained recipe
rec_imp_trained <- rec_imp %>%
prep()
# you can see that the training has worked
tidy(rec_imp_trained, number = 1)
# but it is not applied to the data
rec_imp_trained %>%
juice()
# also does not work with bake() and new data
set.seed(123)
new_df <- tibble(
a = sample(letters, 3),
b = rnorm(3),
c = c(sample(1:10, 2), NA)
)
rec_imp_trained %>%
bake(new_df)
I'd be very glad to hear some feedback or get some help
I think step_lowerimpute() does not do what you think it does (or even what I thought it did, and I have contributed to this function). Check out the description:
step_impute_lower creates a specification of a recipe step designed for cases where the non-negative numeric data cannot be measured below a known value. In these cases, one method for imputing the data is to substitute the truncated value by a random uniform number between zero and the truncation point.
It is not meant to impute missing data with the minimum value for that variable, but rather to impute any number less than or equal to the lower truncation boundary with a sample from a random uniform distribution.
Understanding this, your code actually works...
library(recipes)
library(dplyr)
# data
set.seed(123)
df <- tibble(
a = letters[1:10],
b = rnorm(10),
c = c(rep(1, 3), rep(2, 2), rep(NA, 2), rep(10, 3))
)
# recipe
rec <- recipe(~ ., data = df)
rec_imp <-
rec %>%
step_lowerimpute(c)
# trained recipe
rec_imp_trained <-
rec_imp %>%
prep()
juice(rec_imp_trained)
#> # A tibble: 10 x 3
#> a b c
#> <fct> <dbl> <dbl>
#> 1 a -0.560 0.709
#> 2 b -0.230 0.544
#> 3 c 1.56 0.594
#> 4 d 0.0705 2
#> 5 e 0.129 2
#> 6 f 1.72 NA
#> 7 g 0.461 NA
#> 8 h -1.27 10
#> 9 i -0.687 10
#> 10 j -0.446 10
thank you very much for your reply!
Then I have mistaken what this function actually does.
Just one question regarding your suggestion for an alternative approach to my idea:
When adding your step_mutate() function together with replace_na() does it learn the minimum value from the training data and use it for new data, too?
Good question. I had actually never used step_mutate(), but the way I had it set up before it would learn the minimum value in whatever data it was being applied to, which isn't ideal. So here's how you can get around that by replacing min(c) with !!min(df$c), so that the the minimum is forced to be from df and not from whatever c variable is found in the data mask.
library(recipes)
library(tidyr)
library(dplyr)
set.seed(123)
df <- tibble(
a = letters[1:10],
b = rnorm(10),
c = c(rep(1, 3), rep(2, 2), rep(NA, 2), rep(10, 3))
)
rec <- recipe(~ ., data = df)
rec_imp <-
rec %>%
step_mutate(c = replace_na(c, !!min(df$c, na.rm = TRUE)))
rec_imp_trained <-
rec_imp %>%
prep()
juice(rec_imp_trained)
#> # A tibble: 10 x 3
#> a b c
#> <fct> <dbl> <dbl>
#> 1 a -0.560 1
#> 2 b -0.230 1
#> 3 c 1.56 1
#> 4 d 0.0705 2
#> 5 e 0.129 2
#> 6 f 1.72 1
#> 7 g 0.461 1
#> 8 h -1.27 10
#> 9 i -0.687 10
#> 10 j -0.446 10
df2 <- tibble(
a = letters[1:10],
b = rnorm(10),
c = c(rep(5, 3), rep(2, 2), rep(NA, 2), rep(10, 3))
)
bake(rec_imp_trained, new_data = df2)
#> # A tibble: 10 x 3
#> a b c
#> <fct> <dbl> <dbl>
#> 1 a 0.549 5
#> 2 b 0.238 5
#> 3 c -1.05 5
#> 4 d 1.29 2
#> 5 e 0.826 2
#> 6 f -0.0557 1
#> 7 g -0.784 1
#> 8 h -0.734 10
#> 9 i -0.216 10
#> 10 j -0.335 10