step_percentile, new data outside range of training data

Hi, I'm wondering what the suggested course of action would be when a recipe that uses step_percentile encounters a new data value outside the range on which it was prepped.

Example:

library(dplyr)
library(recipes)

train_df <- tibble(
    a = 1:10,
    b = 10:1
)

rec <-
    train_df %>%
    recipe(a ~ b) %>%
    step_percentile(
        b,
        options = list(
            probs = seq(0, 1, by = 1/4)
        )
    ) %>%
    prep()

new_df <- tibble(a = c(1, 4, 5), b = c(0.99, 5, 10.01))

bake(rec, new_data = new_df)
#> # A tibble: 3 x 2
#>        b     a
#>    <dbl> <dbl>
#> 1 NA         1
#> 2  0.444     4
#> 3 NA         5

I understand why it is returning NA, but I could see it being desirable to have values outside the range of the training data be set to the highest/lowest quantile value. Since that isn't an option, would it simply be best create a recipe step to cap the data to a pre-determined range?

Thanks!

This is a very good question! I added this PR outside argument for step_percentile by EmilHvitfeldt · Pull Request #1075 · tidymodels/recipes · GitHub which will give you an option of how values outside the range should be handled. It will hopefully be merged soon, and we are planning a CRAN in the next couple of days.

library(dplyr)
library(recipes)

train_df <- tibble(
  a = 1:10,
  b = 10:1
)

new_df <- tibble(a = c(1, 4, 5), b = c(0.99, 5, 10.01))

# Defaults to `outside = "none"`
train_df %>%
  recipe(a ~ b) %>%
  step_percentile(b, outside = "none") %>%
  prep() %>%
  bake(new_data = new_df)
#> # A tibble: 3 × 2
#>        b     a
#>    <dbl> <dbl>
#> 1 NA         1
#> 2  0.444     4
#> 3 NA         5

train_df %>%
  recipe(a ~ b) %>%
  step_percentile(b, outside = "both") %>%
  prep() %>%
  bake(new_data = new_df)
#> # A tibble: 3 × 2
#>       b     a
#>   <dbl> <dbl>
#> 1 0         1
#> 2 0.444     4
#> 3 1         5

train_df %>%
  recipe(a ~ b) %>%
  step_percentile(b, outside = "lower") %>%
  prep() %>%
  bake(new_data = new_df)
#> # A tibble: 3 × 2
#>        b     a
#>    <dbl> <dbl>
#> 1  0         1
#> 2  0.444     4
#> 3 NA         5

train_df %>%
  recipe(a ~ b) %>%
  step_percentile(b, outside = "upper") %>%
  prep() %>%
  bake(new_data = new_df)
#> # A tibble: 3 × 2
#>        b     a
#>    <dbl> <dbl>
#> 1 NA         1
#> 2  0.444     4
#> 3  1         5

Created on 2023-01-05 with reprex v2.0.2

1 Like

Excellent! Thank you.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.