Using Predict from Tidymodels recipes, worflows and fit models (parsnip) to score new data

This is the closest "Learn" topic I could find. Generally it is a best practice to be able to "save" imputation models and scaling/centering parameters based on the training data set (these affect the parameters in any modeling algorithm) and use these same imputation models and parameters on the testing or future data to be scored rather than redoing them (as this example seems to indicate) for each and every data set that flows through the workflow and predict. I was able to do this previously in CARET. Maybe I am missing something in Tidymodels.

The learn topic below seems to indicate differently.

I'm not sure I understand your question. As far as I know a trained workflow remembers the values from the trainingdata. New data will be centered with the values from the trainingset.

This would be good news, but I need to have it confirmed.

Can confirm. All imputation (and any other preprocessing) is based off of the original training set.

Great news, and thanks Max. The other clarification is how they are to be saved. Is it done with the final "Predict" or otherwise. And, is it done strictly through YAML or other methods?

The workflow object contains the preprocessing object (e.g. a recipe) and that stores all of the information used to encode/format/preprocess new data.

In some cases, the model function itself might do some of this. When that is the case, the model object would contain the training set statistics.

For example:

library(tidymodels)
tidymodels_prefer()
theme_set(theme_bw())

rec <- 
  recipe(mpg ~ ., data = mtcars) %>% 
  step_normalize(all_numeric_predictors(), id = "norm")

model_fit <-
  workflow() %>% 
  add_recipe(rec) %>% 
  add_model(linear_reg()) %>% 
  fit(data = mtcars)

# Get the "fitted" recipe: 
model_fit %>% 
  extract_recipe()
#> Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor         10
#> 
#> Training data contained 32 data points and no missing data.
#> 
#> Operations:
#> 
#> Centering and scaling for cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb [trained]

# Get the training set means and sds
model_fit %>% 
  extract_recipe() %>% 
  tidy(id = "norm")
#> # A tibble: 20 × 4
#>    terms statistic   value id   
#>    <chr> <chr>       <dbl> <chr>
#>  1 cyl   mean        6.19  norm 
#>  2 disp  mean      231.    norm 
#>  3 hp    mean      147.    norm 
#>  4 drat  mean        3.60  norm 
#>  5 wt    mean        3.22  norm 
#>  6 qsec  mean       17.8   norm 
#>  7 vs    mean        0.438 norm 
#>  8 am    mean        0.406 norm 
#>  9 gear  mean        3.69  norm 
#> 10 carb  mean        2.81  norm 
#> 11 cyl   sd          1.79  norm 
#> 12 disp  sd        124.    norm 
#> 13 hp    sd         68.6   norm 
#> 14 drat  sd          0.535 norm 
#> 15 wt    sd          0.978 norm 
#> 16 qsec  sd          1.79  norm 
#> 17 vs    sd          0.504 norm 
#> 18 am    sd          0.499 norm 
#> 19 gear  sd          0.738 norm 
#> 20 carb  sd          1.62  norm

Created on 2022-03-21 by the reprex package (v2.0.1)

Thanks, clear enough, but, I meant for processing future data, ass in scoring new inputs from a future use. This would be outside and independent from the initial workflow set up. Could me months from when the original model was set up.

Yes, these values would be used for scoring future data.

Ok, I will look at it more. The real question is how and where to save them....I will look for that info.

You can save the workflow object itself. The preprocessing is in there and, when you use predict, it is all handled automatically. There are no extra steps.

Very nice, thank you.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.