How is steps applied from a recipe in tidymodels?

Hi,
When a recipe, for example, including:
step__center(), step_scale() and step_pca().

is trained on a training set and then applied on a test data set, is the predictors in the test set centered and scaled using the mean and standard deviation from the training set or is it all carried out with information only from the test set?

And if it is not using the parameters from the training data set, how can I make it use it?

Thanks in advance,
John

For discussions related to modeling, machine learning and deep learning. Related packages include caret, modelr, yardstick, rsample, parsnip, tensorflow, keras, cloudml, and tfestimators.

Hi @johnG,

When you use step_center() or step_scale() in a recipe (these are just two examples), the relevant values needed to center (mean) and scale (SD) are estimated in the data you pass to the prep() function. This should be your training data. The mean and SD are estimated in the training data, and then when the recipe is applied to your testing data (with bake()), the recipe will perform the centering and scaling with the mean and SD estimated using the training data only (i.e. the test data is never used for estimating pre-processing values).

3 Likes

Further evidence...

library(tidymodels)

split <- initial_split(mtcars)
train <- training(split)
test <- testing(split)

rec <- recipe(mpg ~ hp, data = mtcars)

rec <-
  rec %>% 
  step_center(all_predictors()) %>% 
  step_scale(all_predictors())

prepd_rec <- prep(rec, training = train)

prepd_rec$steps[[1]]$means
#>       hp 
#> 147.2083
prepd_rec$steps[[2]]$sds
#>       hp 
#> 62.48546

mean(train$hp)
#> [1] 147.2083
sd(train$hp)
#> [1] 62.48546

# Uses mean and SD from train
bake(prepd_rec, new_data = test)
#> # A tibble: 8 x 2
#>        hp   mpg
#>     <dbl> <dbl>
#> 1 -0.868   22.8
#> 2  0.445   18.7
#> 3  0.525   17.3
#> 4 -1.32    33.9
#> 5 -0.804   21.5
#> 6  0.0447  15.2
#> 7 -1.30    27.3
#> 8  3.01    15

test %>% 
  mutate(
    hp = (hp - mean(train$hp)) / sd(train$hp)
  ) %>% 
  as_tibble() %>% 
  select(hp, mpg)
#> # A tibble: 8 x 2
#>        hp   mpg
#>     <dbl> <dbl>
#> 1 -0.868   22.8
#> 2  0.445   18.7
#> 3  0.525   17.3
#> 4 -1.32    33.9
#> 5 -0.804   21.5
#> 6  0.0447  15.2
#> 7 -1.30    27.3
#> 8  3.01    15

Created on 2020-09-01 by the reprex package (v0.3.0)

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.