How to specify a column to be unaffected in `recipes`?

jchou · February 4, 2019, 7:30pm

I am trying to learn how to use recipes to do an initial set of preprocessing steps, but I'm having a hard time figuring out how to define an 'id' (character) column which should NOT be processed or changed in any way.

Currently, I can prep a recipe using a training dataset, and the id column is changed from character to factor (undesired behavior, but not terrible). However, when I bake new datasets (like a validation or testing dataset), the id's all get converted to NA.

Sorry for not having a reproducible example, but here's the relevant code.

Is there a way to just have id be completely unaffected by the recipe, not being changed from character to factor, and not being touched when baked from new datasets?

rec_obj <- recipe(x = df_train) %>%
  update_role(next_result, new_role = 'outcome') %>% # set the outcome variable
  update_role(id, new_role = "id variable") %>% # id is NOT a predictor, and should NOT be touched
  update_role(time_step, new_role = "timestep variable") %>% # time_step is NOT a predictor, and should NOT be touched
  update_role(-next_result, -id, -time_step, new_role = 'predictor') %>% # everything else is a predictor
  step_dummy(enchosp) %>% # this predictor is a factor and should be encoded with dummy variables
  step_center(pred1, pred2, pred3, pred4, pred5) %>% # center + scale the numeric predictors
  step_scale(pred1, pred2, pred3, pred4, pred5) %>%
  step_medianimpute(all_numeric()) # median impute missing numbers

rec_trained <- prep(rec_obj, training = df_train)

train_data    <- bake(rec_trained, new_data = df_train)
validate_data <- bake(rec_trained, new_data = df_validate)
test_data     <- bake(rec_trained, new_data = df_test)

Incidentally, the reason I want the id's to remain after preprocessing is that I need to subsequently do some heavy processing on the datasets, generating padded and windowed time-series data from each id and its own time steps, and then AFTER that time series processing has occurred, THEN I'll remove the id for feeding into an LSTM neural network for modeling / testing.

mara · February 8, 2019, 1:42pm

Hi @jchou,

I'm gonna go ahead and move this into the Machine Learning and Modeling category, since I think you're more likely to get the help you need in that category. recipes is part of tidymodels, but this category is geared more toward the tidyverse packages for interactive data analysis. Not a big deal at all, just wanted to explain my rationale!

jchou · February 8, 2019, 2:09pm

Thank you.

In the meantime, as an incredibly ugly hacky work-around which is just horrible beyond belief, I took advantage of the fact that I know how to get recipes to ignore a numeric column.

The workaround is to convert the string id column into a factor and then coerce the factor into an integer...

df_train <- df_train %>% mutate(id = as.integer(as.factor(id))

Then, after the recipe bakes (and leaves the now numeric id column unaltered), I then convert the id column back into a string, so I can perform my padding / windowing processing on the dataset, for LSTM training.

train_data <- bake(rec_trained, new_data = df_train) %>%
    mutate(id = as.character(id))

This code hurts to look at but it allows my workflow to proceed.

Still, I'd prefer it if there were some way to convince recipes to just ignore the character column in the first place...

Max · February 10, 2019, 1:52pm

If id is integer, then step_medianimpute(all_numeric()) will choose it for imputation. I don't know why the missing values are produced (not without a reproducible example).

Note that there is a step_mutate function.

jchou · February 10, 2019, 3:01pm

Thank you, yes. I hadn't posted that in my hacky fix I updated to: step_medianimpute(all_predictors()) (Also, the ID would never be missing, so there would never be anything to impute, but still, your point is well taken.)

Here's a full reprex() showing how the character ID column gets converted to a factor, and then subsequently to missing.

The end goal would be to have the id column completely untouched in the baked test_data, and left as a string (not converted to a factor).

library(dplyr, warn.conflicts = FALSE)
library(recipes, warn.conflicts = FALSE)


data <- tibble(
  id = letters[1:12],
  output = rnorm(12, mean = 0),
  pred1 = rnorm(12, mean = 10),
  pred2 = rnorm(12, mean = 20),
  pred3 = factor(rep(c('f1', 'f2', 'f3'), 4))
)

data$pred1[c(1,6)] <- NA
data$pred2[c(2,7)] <- NA
data
#> # A tibble: 12 x 5
#>    id     output pred1 pred2 pred3
#>    <chr>   <dbl> <dbl> <dbl> <fct>
#>  1 a     -0.551  NA     20.7 f1   
#>  2 b     -0.940  10.8   NA   f2   
#>  3 c     -2.05   10.1   20.7 f3   
#>  4 d     -1.12   10.5   19.4 f1   
#>  5 e      0.718  10.5   20.6 f2   
#>  6 f      0.451  NA     21.8 f3   
#>  7 g      0.847  10.1   NA   f1   
#>  8 h     -0.524  11.4   22.1 f2   
#>  9 i     -0.171   9.19  19.4 f3   
#> 10 j     -0.268  11.9   19.7 f1   
#> 11 k      1.58    9.97  19.9 f2   
#> 12 l      0.0731  9.50  20.5 f3

df_train <- data[1:5,]
df_test <- data[6:10,]

rec_obj <- recipe(x = df_train) %>%
  update_role(output, new_role = 'outcome') %>%
  update_role(id, new_role = "id variable") %>%
  update_role(-output, -id, new_role = 'predictor') %>%
  step_dummy(pred3) %>%
  step_center(pred1, pred2) %>%
  step_scale(pred1, pred2) %>%
  step_medianimpute(all_predictors())

rec_obj
#> Data Recipe
#> 
#> Inputs:
#> 
#>         role #variables
#>  id variable          1
#>      outcome          1
#>    predictor          3
#> 
#> Operations:
#> 
#> Dummy variables from pred3
#> Centering for pred1, pred2
#> Scaling for pred1, pred2
#> Median Imputation for all_predictors()

rec_trained <- prep(rec_obj, training = df_train)
train_data    <- bake(rec_trained, new_data = df_train)
test_data     <- bake(rec_trained, new_data = df_test)

df_train
#> # A tibble: 5 x 5
#>   id    output pred1 pred2 pred3
#>   <chr>  <dbl> <dbl> <dbl> <fct>
#> 1 a     -0.551  NA    20.7 f1   
#> 2 b     -0.940  10.8  NA   f2   
#> 3 c     -2.05   10.1  20.7 f3   
#> 4 d     -1.12   10.5  19.4 f1   
#> 5 e      0.718  10.5  20.6 f2
train_data
#> # A tibble: 5 x 6
#>   id    output    pred1  pred2 pred3_f2 pred3_f3
#>   <fct>  <dbl>    <dbl>  <dbl>    <dbl>    <dbl>
#> 1 a     -0.551 -0.00194  0.563        0        0
#> 2 b     -0.940  1.23     0.466        1        0
#> 3 c     -2.05  -1.22     0.556        0        1
#> 4 d     -1.12   0.00188 -1.49         0        0
#> 5 e      0.718 -0.00575  0.375        1        0

df_test
#> # A tibble: 5 x 5
#>   id    output pred1 pred2 pred3
#>   <chr>  <dbl> <dbl> <dbl> <fct>
#> 1 f      0.451 NA     21.8 f3   
#> 2 g      0.847 10.1   NA   f1   
#> 3 h     -0.524 11.4   22.1 f2   
#> 4 i     -0.171  9.19  19.4 f3   
#> 5 j     -0.268 11.9   19.7 f1
test_data
#> # A tibble: 5 x 6
#>   id    output    pred1  pred2 pred3_f2 pred3_f3
#>   <fct>  <dbl>    <dbl>  <dbl>    <dbl>    <dbl>
#> 1 <NA>   0.451 -0.00194  2.06         0        1
#> 2 <NA>   0.847 -1.32     0.466        0        0
#> 3 <NA>  -0.524  3.42     2.64         1        0
#> 4 <NA>  -0.171 -4.77    -1.50         0        1
#> 5 <NA>  -0.268  5.67    -0.998        0        0

Max · February 10, 2019, 9:17pm

That does appear to be a bug. In the meantime, you might be able to use an extra option for prep to make it work:

library(dplyr, warn.conflicts = FALSE)
library(recipes, warn.conflicts = FALSE)

data <- tibble(
  id = letters[1:12],
  output = rnorm(12, mean = 0),
  pred1 = rnorm(12, mean = 10),
  pred2 = rnorm(12, mean = 20),
  pred3 = factor(rep(c('f1', 'f2', 'f3'), 4))
)

data$pred1[c(1,6)] <- NA
data$pred2[c(2,7)] <- NA
df_train <- data[1:5,]
df_test <- data[6:10,]

rec_obj <- recipe(x = df_train) %>%
  update_role(output, new_role = 'outcome') %>%
  update_role(id, new_role = "id variable") %>%
  update_role(-output, -id, new_role = 'predictor') %>%
  step_dummy(pred3) %>%
  step_center(pred1, pred2) %>%
  step_scale(pred1, pred2) %>%
  step_medianimpute(all_predictors())

rec_trained <- prep(rec_obj, training = df_train, strings_as_factors = FALSE)
train_data    <- bake(rec_trained, new_data = df_train)
test_data     <- bake(rec_trained, new_data = df_test)
test_data
#> # A tibble: 5 x 6
#>   id     output  pred1   pred2 pred3_f2 pred3_f3
#>   <chr>   <dbl>  <dbl>   <dbl>    <dbl>    <dbl>
#> 1 f      0.976   0.159  1.16          0        1
#> 2 g     -1.18   -1.21   0.223         0        0
#> 3 h     -0.623   0.152  0.0847        1        0
#> 4 i      0.0742  0.811  0.266         0        1
#> 5 j      0.810  -1.71  -3.69          0        0

Created on 2019-02-10 by the reprex package (v0.2.1)

jchou · February 11, 2019, 7:25pm

Thank you, that works great.

I'm a little embarrassed I didn't come across that fix. I'm not even sure it should be considered a 'bug', as it's well-documented at https://cran.r-project.org/web/packages/recipes/recipes.pdf.

Thanks again!

system · March 4, 2019, 7:31pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.