Expected Behaviors for Tidymodels -> Plumber Deployments

mtanney · May 18, 2023, 8:22pm

Hi Posit Community.

Do the preprocessing steps from a recipe that are then included as a tidymodels workflow impact the request body of a plumber API? I believe the answer is no based on current testing and the reprex below, but I wanted to confirm if what I am experiencing currently is expected behavior. My request body inputs may be null/missing at times for a variety of reasons, and I was hoping that the recipe part of the model would correct this on the fly in the plumber API.

In the event what I'm experiencing currently is expected behavior, what are common recommendations for imputing null/missing values for API requests so that the model generates a prediction response rather than a 500 error when a value is missing? Is the fix as easy as converting the request body to a data frame and prepping that data with the recipe from the modeling workflow before passing it to the predict() function?

# Load Libraries ----------------------------------------------------------
library(plumber)
library(tidymodels)
library(tidyverse)


# Construct Basic Model ---------------------------------------------------

# Load and split data
df = mtcars
train_df = df[1:25, ]
test_df = df[26:32, ]
train_df$disp[1:5] = NA
train_df$cyl[1:5] = NA

# Define Recipe
mod_rec = recipe(mpg ~ cyl + disp + hp, data = train_df) %>% 
          step_impute_median(all_numeric_predictors())
prep(mod_rec, verbose = TRUE)
#> oper 1 step impute median [training] 
#> The retained training set is ~ 0 Mb  in memory.
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:   1
#> predictor: 3
#> 
#> ── Training information
#> Training data contained 25 data points and 5 incomplete rows.
#> 
#> ── Operations
#> • Median imputation for: cyl, disp, hp | Trained

# Define Model
tree_mod = decision_tree() %>% 
           set_mode("regression") %>% 
           set_engine("rpart")

# Define Workflow
tree_wkflow = workflow() %>% 
              add_recipe(mod_rec) %>% 
              add_model(tree_mod)

# Fit Model
mod1 = fit(tree_wkflow, train_df)
saveRDS(mod1, file = "cars.rds")


# API ---------------------------------------------------------------------

trained_mod = readRDS("cars.rds")

#* How many mpg should we expect?
#* @post /predict_mpg
function(req, res) {
  predict(trained_mod, new_data = as.data.frame(req$body))
}
#> function(req, res) {
#>   predict(trained_mod, new_data = as.data.frame(req$body))
#> }

# Update UI
#* @plumber
function(pr) {
  pr %>% pr_set_api_spec(yaml::read_yaml("cars_yml.yml"))
}
#> function(pr) {
#>   pr %>% pr_set_api_spec(yaml::read_yaml("cars_yml.yml"))
#> }

^{Created on 2023-05-18 with reprex v2.0.2}

Session info

sessionInfo()
#> R version 4.2.2 (2022-10-31)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Ventura 13.3.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] forcats_0.5.1        stringr_1.5.0        readr_2.1.2         
#>  [4] tidyverse_1.3.2      yardstick_1.1.0.9001 workflowsets_1.0.0  
#>  [7] workflows_1.1.3      tune_1.1.1           tidyr_1.3.0         
#> [10] tibble_3.2.1         rsample_1.1.1        recipes_1.0.5       
#> [13] purrr_1.0.1          parsnip_1.1.0        modeldata_1.0.0     
#> [16] infer_1.0.2          ggplot2_3.4.2        dplyr_1.1.1         
#> [19] dials_1.2.0          scales_1.2.1         broom_1.0.0         
#> [22] tidymodels_1.0.0     plumber_1.2.1       
#> 
#> loaded via a namespace (and not attached):
#>  [1] fs_1.5.2            lubridate_1.9.2     httr_1.4.3         
#>  [4] DiceDesign_1.9      tools_4.2.2         backports_1.4.1    
#>  [7] utf8_1.2.3          R6_2.5.1            rpart_4.1.19       
#> [10] DBI_1.1.3           colorspace_2.1-0    nnet_7.3-18        
#> [13] withr_2.5.0         tidyselect_1.2.0    compiler_4.2.2     
#> [16] rvest_1.0.2         cli_3.6.1           swagger_3.33.1     
#> [19] xml2_1.3.3          digest_0.6.31       rmarkdown_2.14     
#> [22] webutils_1.1        pkgconfig_2.0.3     htmltools_0.5.3    
#> [25] parallelly_1.35.0   lhs_1.1.6           dbplyr_2.2.1       
#> [28] fastmap_1.1.0       highr_0.9           readxl_1.4.0       
#> [31] rlang_1.1.0         rstudioapi_0.13     generics_0.1.3     
#> [34] jsonlite_1.8.4      googlesheets4_1.0.0 magrittr_2.0.3     
#> [37] Matrix_1.5-1        Rcpp_1.0.10         munsell_0.5.0      
#> [40] fansi_1.0.4         GPfit_1.0-8         lifecycle_1.0.3    
#> [43] furrr_0.3.1         stringi_1.7.12      yaml_2.3.5         
#> [46] MASS_7.3-58.1       grid_4.2.2          parallel_4.2.2     
#> [49] listenv_0.9.0       promises_1.2.0.1    crayon_1.5.2       
#> [52] lattice_0.20-45     haven_2.5.0         splines_4.2.2      
#> [55] hms_1.1.1           knitr_1.39          pillar_1.9.0       
#> [58] future.apply_1.10.0 codetools_0.2-18    reprex_2.0.2       
#> [61] glue_1.6.2          evaluate_0.15       modelr_0.1.8       
#> [64] data.table_1.14.8   tzdb_0.3.0          vctrs_0.6.1        
#> [67] foreach_1.5.2       cellranger_1.1.0    gtable_0.3.3       
#> [70] future_1.32.0       assertthat_0.2.1    xfun_0.31          
#> [73] gower_1.0.1         prodlim_2023.03.31  later_1.3.0        
#> [76] googledrive_2.0.0   class_7.3-20        survival_3.4-0     
#> [79] gargle_1.2.0        timeDate_4022.108   iterators_1.0.14   
#> [82] hardhat_1.3.0.9000  lava_1.7.2.1        timechange_0.2.0   
#> [85] globals_0.16.2      ellipsis_0.3.2      ipred_0.9-14

mtanney · May 18, 2023, 8:30pm

I failed to include this in the original message, but here is the yaml file and a screenshot of the API when I try to pass it a missing value currently.

openapi: 3.0.3
info:
  description: Cars
  title: Cars Reprex
  version: "1.0.0"
paths:
  /predict_mpg:
    post:
      summary: "Predict MPG"
      responses:
        default:
          description: Default response.
      parameters: []
      requestBody:
        description: Car Attributes
        required: false
        content:
          application/json:
            schema:
              type: object
              properties:
                cyl:
                  type: number
                  example: 6
                disp:
                  type: number
                  example: 175
                hp:
                  type: number
                  example: 150

julia · May 22, 2023, 3:02pm

I believe what you need to do is pass in an explicit NA value, like this:

If you are interested in creating Plumber APIs for tidymodels workflows, you might be interested in using vetiver. You would set up your API like so:

library(vetiver)
library(plumber)
v <- vetiver_model(mod1, "cars-rpart")
pr() |> vetiver_api(v) |> pr_run()

mtanney · May 22, 2023, 4:16pm

Thanks for reviewing this code and the recommendation with vetiver, @julia. I appreciate the prompt reply.

I tried passing "NA", and unfortunately I still receive an error. I also receive an error when I pass NA only to the API. Both error screenshots are captured below.

Did you by chance alter the yaml and/or alter the data types? It seems odd to me that I wouldn't see a successful call in the same way that you did in the screenshot you submitted.

In the process of creating this reprex and tinkering with different options, I think I may have found a solution (h/t Tom Mock & his great post about the value of a reprex). I'll post my proposed solution in a few.

mtanney · May 22, 2023, 4:46pm

Here's the alternate solution I described above. I have also added screenshots below to highlight the result when I pass the API 3 empty strings using this updated code. The print() statements confirm that the added recipe() logic is functioning as intended.

# Load Libraries ----------------------------------------------------------
library(plumber)
library(tidymodels)
library(tidyverse)


# Construct Basic Model ---------------------------------------------------

# Load and split data
df = mtcars
train_df = df[1:25, ]
test_df = df[26:32, ]
train_df$disp[1:5] = NA
train_df$cyl[1:5] = NA

# Define Recipe
mod_rec = recipe(mpg ~ cyl + disp + hp, data = train_df) %>% 
  step_impute_median(all_numeric_predictors())
prep(mod_rec, verbose = TRUE)
#> oper 1 step impute median [training] 
#> The retained training set is ~ 0 Mb  in memory.
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:   1
#> predictor: 3
#> 
#> ── Training information
#> Training data contained 25 data points and 5 incomplete rows.
#> 
#> ── Operations
#> • Median imputation for: cyl, disp, hp | Trained


# Define Model
tree_mod = decision_tree() %>% 
  set_mode("regression") %>% 
  set_engine("rpart")

# Define Workflow
tree_wkflow = workflow() %>% 
  add_recipe(mod_rec) %>% 
  add_model(tree_mod)

# Fit Model
mod1 = fit(tree_wkflow, train_df)
saveRDS(mod1, file = "cars.rds")


# API ---------------------------------------------------------------------

trained_mod = readRDS("cars.rds")

#* How many mpg should we expect?
#* @post /predict_mpg
function(req, res) {
  df = as.data.frame(req$body)
  df[df == ""] = NA
  print(df)
 
  for (col in colnames(df)){
    df[[col]] = as.numeric(df[[col]])
  }
  
  my_rec = extract_recipe(trained_mod)
  ready_for_predict_df = bake(my_rec, df)
  print(ready_for_predict_df)
  print("***** END TEST *****")
  predict(trained_mod, new_data = ready_for_predict_df)
}
#> function(req, res) {
#>   df = as.data.frame(req$body)
#>   df[df == ""] = NA
#>   print(df)
#>  
#>   for (col in colnames(df)){
#>     df[[col]] = as.numeric(df[[col]])
#>   }
#>   
#>   my_rec = extract_recipe(trained_mod)
#>   ready_for_predict_df = bake(my_rec, df)
#>   print(ready_for_predict_df)
#>   print("***** END TEST *****")
#>   predict(trained_mod, new_data = ready_for_predict_df)
#> }

# Update UI
#* @plumber
function(pr) {
  pr %>% pr_set_api_spec(yaml::read_yaml("cars_yml.yml"))
}
#> function(pr) {
#>   pr %>% pr_set_api_spec(yaml::read_yaml("cars_yml.yml"))
#> }

^{Created on 2023-05-22 with reprex v2.0.2}

Session info

sessionInfo()
#> R version 4.2.2 (2022-10-31)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Ventura 13.4
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] forcats_0.5.1        stringr_1.5.0        readr_2.1.2         
#>  [4] tidyverse_1.3.2      yardstick_1.1.0.9001 workflowsets_1.0.0  
#>  [7] workflows_1.1.3      tune_1.1.1           tidyr_1.3.0         
#> [10] tibble_3.2.1         rsample_1.1.1        recipes_1.0.5       
#> [13] purrr_1.0.1          parsnip_1.1.0        modeldata_1.0.0     
#> [16] infer_1.0.2          ggplot2_3.4.2        dplyr_1.1.1         
#> [19] dials_1.2.0          scales_1.2.1         broom_1.0.0         
#> [22] tidymodels_1.0.0     plumber_1.2.1       
#> 
#> loaded via a namespace (and not attached):
#>  [1] fs_1.5.2            lubridate_1.9.2     httr_1.4.3         
#>  [4] DiceDesign_1.9      tools_4.2.2         backports_1.4.1    
#>  [7] utf8_1.2.3          R6_2.5.1            rpart_4.1.19       
#> [10] DBI_1.1.3           colorspace_2.1-0    nnet_7.3-18        
#> [13] withr_2.5.0         tidyselect_1.2.0    compiler_4.2.2     
#> [16] rvest_1.0.2         cli_3.6.1           swagger_3.33.1     
#> [19] xml2_1.3.3          digest_0.6.31       rmarkdown_2.14     
#> [22] webutils_1.1        pkgconfig_2.0.3     htmltools_0.5.3    
#> [25] parallelly_1.35.0   lhs_1.1.6           dbplyr_2.2.1       
#> [28] fastmap_1.1.0       highr_0.9           readxl_1.4.0       
#> [31] rlang_1.1.0         rstudioapi_0.13     generics_0.1.3     
#> [34] jsonlite_1.8.4      googlesheets4_1.0.0 magrittr_2.0.3     
#> [37] Matrix_1.5-1        Rcpp_1.0.10         munsell_0.5.0      
#> [40] fansi_1.0.4         GPfit_1.0-8         lifecycle_1.0.3    
#> [43] furrr_0.3.1         stringi_1.7.12      yaml_2.3.5         
#> [46] MASS_7.3-58.1       grid_4.2.2          parallel_4.2.2     
#> [49] listenv_0.9.0       promises_1.2.0.1    crayon_1.5.2       
#> [52] lattice_0.20-45     haven_2.5.0         splines_4.2.2      
#> [55] hms_1.1.1           knitr_1.39          pillar_1.9.0       
#> [58] future.apply_1.10.0 codetools_0.2-18    reprex_2.0.2       
#> [61] glue_1.6.2          evaluate_0.15       modelr_0.1.8       
#> [64] data.table_1.14.8   tzdb_0.3.0          vctrs_0.6.1        
#> [67] foreach_1.5.2       cellranger_1.1.0    gtable_0.3.3       
#> [70] future_1.32.0       assertthat_0.2.1    xfun_0.31          
#> [73] gower_1.0.1         prodlim_2023.03.31  later_1.3.0        
#> [76] googledrive_2.0.0   class_7.3-20        survival_3.4-0     
#> [79] gargle_1.2.0        timeDate_4022.108   iterators_1.0.14   
#> [82] hardhat_1.3.0.9000  lava_1.7.2.1        timechange_0.2.0   
#> [85] globals_0.16.2      ellipsis_0.3.2      ipred_0.9-14

julia · May 24, 2023, 3:20am

I did use vetiver for the screenshot I showed you earlier. You may be interested in checking it out because it handles a lot of that checking automatically, without explicitly needing to bake() which you don't really want to do in most cases (you can end up "double preprocessing" your data):

library(tidymodels)

df <- mtcars
train_df <- df[1:25, ]
test_df <- df[26:32, ]
train_df$disp[1:5] = NA
train_df$cyl[1:5] = NA

mod_rec <- recipe(mpg ~ cyl + disp + hp, data = train_df) |> 
    step_impute_median(all_numeric_predictors())

tree_spec <- decision_tree(mode = "regression")
tree_wkflow <- workflow(mod_rec, tree_spec)
mod1 <- fit(tree_wkflow, train_df)

## can predict on the original model
predict(mod1, tibble(cyl = 6, disp = 175, hp = NA))
#> # A tibble: 1 × 1
#>   .pred
#>   <dbl>
#> 1  16.6


library(vetiver)
#> 
#> Attaching package: 'vetiver'
#> The following object is masked from 'package:tune':
#> 
#>     load_pkgs
v <- vetiver_model(mod1, "cars-rpart")

## can prediction on the vetiver model
predict(v, tibble(cyl = 6, disp = 175, hp = NA))
#> # A tibble: 1 × 1
#>   .pred
#>   <dbl>
#> 1  16.6


library(plumber)
pr() |> vetiver_api(v)
#> # Plumber router with 3 endpoints, 4 filters, and 1 sub-router.
#> # Use `pr_run()` on this object to start the API.
#> ├──[queryString]
#> ├──[body]
#> ├──[cookieParser]
#> ├──[sharedSecret]
#> ├──/logo
#> │  │ # Plumber static router serving from directory: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/vetiver
#> ├──/metadata (GET)
#> ├──/ping (GET)
#> └──/predict (POST)
## next pipe to `pr_run()` for local API

^{Created on 2023-05-23 with reprex v2.0.2}

If I do pr() |> vetiver_api(v) |> pr_run() and then interact with the model via the docs, I see this:

If you want to know what vetiver is doing under the hood to convert/coerce the new data, you can look here to see the handler_predict() function and especially notice the vetiver_type_convert() function. That is an exported function so you could use it instead of bake() if it is important to your use case to code the API from scratch rather than use vetiver; that function will be more appropriate in more situations.

mtanney · May 24, 2023, 2:03pm

Thanks for the additional updates here, @julia. I wasn't exactly sure how you were able to generate a prediction using NA based on the reprex, but knowing that you used vetiver makes much more sense now. Sorry for missing that in your initial reply. I'll share this feedback with our data science team, and thanks again for your help.

julia · May 24, 2023, 3:06pm

Absolutely! Please do ask another question or open an issue if you run into problems as you look at how to use vetiver.

system · July 5, 2023, 3:07pm

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.