Using Predict from Tidymodels recipes, worflows and fit models (parsnip) to score new data

TDR · March 18, 2022, 12:09pm

This is the closest "Learn" topic I could find. Generally it is a best practice to be able to "save" imputation models and scaling/centering parameters based on the training data set (these affect the parameters in any modeling algorithm) and use these same imputation models and parameters on the testing or future data to be scored rather than redoing them (as this example seems to indicate) for each and every data set that flows through the workflow and predict. I was able to do this previously in CARET. Maybe I am missing something in Tidymodels.

The learn topic below seems to indicate differently.

RMH · March 18, 2022, 1:19pm

I'm not sure I understand your question. As far as I know a trained workflow remembers the values from the trainingdata. New data will be centered with the values from the trainingset.

TDR · March 18, 2022, 1:48pm

This would be good news, but I need to have it confirmed.

Max · March 18, 2022, 3:10pm

Can confirm. All imputation (and any other preprocessing) is based off of the original training set.

TDR · March 21, 2022, 2:53pm

Great news, and thanks Max. The other clarification is how they are to be saved. Is it done with the final "Predict" or otherwise. And, is it done strictly through YAML or other methods?

Max · March 21, 2022, 4:44pm

The workflow object contains the preprocessing object (e.g. a recipe) and that stores all of the information used to encode/format/preprocess new data.

In some cases, the model function itself might do some of this. When that is the case, the model object would contain the training set statistics.

For example:

library(tidymodels)
tidymodels_prefer()
theme_set(theme_bw())

rec <- 
  recipe(mpg ~ ., data = mtcars) %>% 
  step_normalize(all_numeric_predictors(), id = "norm")

model_fit <-
  workflow() %>% 
  add_recipe(rec) %>% 
  add_model(linear_reg()) %>% 
  fit(data = mtcars)

# Get the "fitted" recipe: 
model_fit %>% 
  extract_recipe()
#> Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor         10
#> 
#> Training data contained 32 data points and no missing data.
#> 
#> Operations:
#> 
#> Centering and scaling for cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb [trained]

# Get the training set means and sds
model_fit %>% 
  extract_recipe() %>% 
  tidy(id = "norm")
#> # A tibble: 20 × 4
#>    terms statistic   value id   
#>    <chr> <chr>       <dbl> <chr>
#>  1 cyl   mean        6.19  norm 
#>  2 disp  mean      231.    norm 
#>  3 hp    mean      147.    norm 
#>  4 drat  mean        3.60  norm 
#>  5 wt    mean        3.22  norm 
#>  6 qsec  mean       17.8   norm 
#>  7 vs    mean        0.438 norm 
#>  8 am    mean        0.406 norm 
#>  9 gear  mean        3.69  norm 
#> 10 carb  mean        2.81  norm 
#> 11 cyl   sd          1.79  norm 
#> 12 disp  sd        124.    norm 
#> 13 hp    sd         68.6   norm 
#> 14 drat  sd          0.535 norm 
#> 15 wt    sd          0.978 norm 
#> 16 qsec  sd          1.79  norm 
#> 17 vs    sd          0.504 norm 
#> 18 am    sd          0.499 norm 
#> 19 gear  sd          0.738 norm 
#> 20 carb  sd          1.62  norm

^{Created on 2022-03-21 by the reprex package (v2.0.1)}

TDR · March 21, 2022, 5:52pm

Thanks, clear enough, but, I meant for processing future data, ass in scoring new inputs from a future use. This would be outside and independent from the initial workflow set up. Could me months from when the original model was set up.

Max · March 21, 2022, 6:23pm

Yes, these values would be used for scoring future data.

TDR · March 21, 2022, 6:52pm

Ok, I will look at it more. The real question is how and where to save them....I will look for that info.

Max · March 22, 2022, 10:51am

You can save the workflow object itself. The preprocessing is in there and, when you use predict, it is all handled automatically. There are no extra steps.

TDR · March 22, 2022, 11:49am

Very nice, thank you.

system · April 12, 2022, 11:50am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.