recipes::step_lowerimpute() does not work for imputation

DavidJesse · February 21, 2021, 5:22pm

Hi everyone

I am currently working on a project in which I'd like to use the tidymodels framework and more precisely the {recipes} package for my data preprocessing.
I need to cover various imputation steps. Among others I would like to use step_lowerimpute() to impute missing values for a feature with the minimum value in the training data.
However, I'm running into trouble here. While the training of the recipe including this step does work (i.e. the minimum value in the training data is being stored), applying the recipe to any data does not have any effect. Missing values won't be imputed by the recipe.
I don't know, if it's just me who's making some mistake, but I could not figure anythig out, not even with the following very simple reprex:

library(recipes)
library(dplyr)

# data
set.seed(123)
df <- tibble(
  a = letters[1:10],
  b = rnorm(10),
  c = c(rep(1, 3), rep(2, 2), rep(NA, 2), rep(10, 3))
) 

# recipe
rec <- recipe(~ ., data = df)

rec_imp <- rec %>%
  step_lowerimpute(c)

# trained recipe
rec_imp_trained <- rec_imp %>%
  prep()

# you can see that the training has worked
tidy(rec_imp_trained, number = 1)

# but it is not applied to the data
rec_imp_trained %>%
  juice()

# also does not work with bake() and new data
set.seed(123)
new_df <- tibble(
  a = sample(letters, 3),
  b = rnorm(3),
  c = c(sample(1:10, 2), NA)
)

rec_imp_trained %>%
  bake(new_df)

I'd be very glad to hear some feedback or get some help

Cheers
David

mattwarkentin · February 22, 2021, 4:32pm

Hi @DavidJesse,

I think step_lowerimpute() does not do what you think it does (or even what I thought it did, and I have contributed to this function). Check out the description:

step_impute_lower creates a specification of a recipe step designed for cases where the non-negative numeric data cannot be measured below a known value. In these cases, one method for imputing the data is to substitute the truncated value by a random uniform number between zero and the truncation point.

It is not meant to impute missing data with the minimum value for that variable, but rather to impute any number less than or equal to the lower truncation boundary with a sample from a random uniform distribution.

Understanding this, your code actually works...

library(recipes)
library(dplyr)

# data
set.seed(123)
df <- tibble(
  a = letters[1:10],
  b = rnorm(10),
  c = c(rep(1, 3), rep(2, 2), rep(NA, 2), rep(10, 3))
) 

# recipe
rec <- recipe(~ ., data = df)

rec_imp <- 
  rec %>%
  step_lowerimpute(c)

# trained recipe
rec_imp_trained <- 
  rec_imp %>%
  prep()

juice(rec_imp_trained)
#> # A tibble: 10 x 3
#>    a           b      c
#>    <fct>   <dbl>  <dbl>
#>  1 a     -0.560   0.709
#>  2 b     -0.230   0.544
#>  3 c      1.56    0.594
#>  4 d      0.0705  2    
#>  5 e      0.129   2    
#>  6 f      1.72   NA    
#>  7 g      0.461  NA    
#>  8 h     -1.27   10    
#>  9 i     -0.687  10    
#> 10 j     -0.446  10

mattwarkentin · February 22, 2021, 4:37pm

As a quick followup, this probably does what you want...

library(recipes)
library(tidyr)
library(dplyr)

set.seed(123)
df <- tibble(
  a = letters[1:10],
  b = rnorm(10),
  c = c(rep(1, 3), rep(2, 2), rep(NA, 2), rep(10, 3))
) 

rec <- recipe(~ ., data = df)

rec_imp <- 
  rec %>%
  step_mutate(c = replace_na(c, min(c, na.rm = TRUE)))

rec_imp_trained <- 
  rec_imp %>%
  prep()

juice(rec_imp_trained)
#> # A tibble: 10 x 3
#>    a           b     c
#>    <fct>   <dbl> <dbl>
#>  1 a     -0.560      1
#>  2 b     -0.230      1
#>  3 c      1.56       1
#>  4 d      0.0705     2
#>  5 e      0.129      2
#>  6 f      1.72       1
#>  7 g      0.461      1
#>  8 h     -1.27      10
#>  9 i     -0.687     10
#> 10 j     -0.446     10

DavidJesse · February 22, 2021, 8:47pm

Hi @mattwarkentin,

thank you very much for your reply!
Then I have mistaken what this function actually does.
Just one question regarding your suggestion for an alternative approach to my idea:
When adding your step_mutate() function together with replace_na() does it learn the minimum value from the training data and use it for new data, too?

Cheers
David

mattwarkentin · February 22, 2021, 9:06pm

Hi @DavidJesse,

Good question. I had actually never used step_mutate(), but the way I had it set up before it would learn the minimum value in whatever data it was being applied to, which isn't ideal. So here's how you can get around that by replacing min(c) with !!min(df$c), so that the the minimum is forced to be from df and not from whatever c variable is found in the data mask.

library(recipes)
library(tidyr)
library(dplyr)

set.seed(123)
df <- tibble(
  a = letters[1:10],
  b = rnorm(10),
  c = c(rep(1, 3), rep(2, 2), rep(NA, 2), rep(10, 3))
) 

rec <- recipe(~ ., data = df)

rec_imp <- 
  rec %>%
  step_mutate(c = replace_na(c, !!min(df$c, na.rm = TRUE)))

rec_imp_trained <- 
  rec_imp %>%
  prep()

juice(rec_imp_trained)
#> # A tibble: 10 x 3
#>    a           b     c
#>    <fct>   <dbl> <dbl>
#>  1 a     -0.560      1
#>  2 b     -0.230      1
#>  3 c      1.56       1
#>  4 d      0.0705     2
#>  5 e      0.129      2
#>  6 f      1.72       1
#>  7 g      0.461      1
#>  8 h     -1.27      10
#>  9 i     -0.687     10
#> 10 j     -0.446     10

df2 <- tibble(
  a = letters[1:10],
  b = rnorm(10),
  c = c(rep(5, 3), rep(2, 2), rep(NA, 2), rep(10, 3))
) 
bake(rec_imp_trained, new_data = df2)
#> # A tibble: 10 x 3
#>    a           b     c
#>    <fct>   <dbl> <dbl>
#>  1 a      0.549      5
#>  2 b      0.238      5
#>  3 c     -1.05       5
#>  4 d      1.29       2
#>  5 e      0.826      2
#>  6 f     -0.0557     1
#>  7 g     -0.784      1
#>  8 h     -0.734     10
#>  9 i     -0.216     10
#> 10 j     -0.335     10

mattwarkentin · February 22, 2021, 9:07pm

@Max Should there be a step_impute_minimum() and step_impute_maximum()? Especially since there is already imputation via mean, median, and mode.

What is your recommended approach for this type of imputation using the recipes API?

system · March 15, 2021, 9:08pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.