Programmatically disable recipe steps for deployment

jsantiago · July 27, 2021, 2:54pm

I am using DVC to organise and run a pipeline to train multiple models based on a yaml-based configuration. These models go through all stages in the pipeline, including using the same preprocessing steps.
Now, I want to set which features each model uses in the configuration.
A low-friction solution is to use step_select() and just keep the features selected in the configuration. The downside with this approach is that the recipe will go through every step in production, even if some models will only use a subset of those features, which is not very efficient.

So my question is, is there an expected/idiomatic way to disable steps after the recipe is created but before it's used in a workflow for tuning/training?
My current guess is to somehow have skip = TRUE as default when creating the recipe, and then switching it to skip = FALSE for those the model needs somehow.

jsantiago · July 28, 2021, 12:31pm

Here's a reprex. This works, but I'm wondering if there's a better way. What do you think?

# features for a specific model, comes from a config file
features <- c("gears_per_carb", "mpg", "disp")

# recipe defines all possible features the models use
unprepped_recipe_full <- recipes::recipe(
  am ~ .,
  data = mtcars
) %>%
  recipes::step_mutate(
    gears_per_carb = gear / carb,
    skip = TRUE,
    id = "gears_per_carb"
  ) %>%
  recipes::step_mutate(
    wt_per_disp = wt / disp,
    skip = TRUE,
    id = "wt_per_disp"
  )

recipe_var_info <- unprepped_recipe_full$var_info

# because training needs the target variable, it's easier to remove the unnecessary features than to step_select
# which would cause problems in prod (e.g. no label in prod data :D) 
unneeded_vars <- recipe_var_info$variable[
  recipe_var_info$role == "predictor" &
    !(recipe_var_info$variable %in% features)
]

unprepped_recipe <- unprepped_recipe_full %>%
  recipes::step_rm(
    dplyr::all_of(unneeded_vars)
  )

# just makes it easier to extract the steps if everything is named
step_ids <- purrr::map_chr(unprepped_recipe$steps, "id")
names(unprepped_recipe$steps) <- step_ids

# not a fan of mutation in place, but I see no other way of doing this here
for (feature in features) {
  message("Feature is ", feature)
  step_id <- unprepped_recipe$steps[[feature]]$id
  message("Step id is ", step_id)

  if (is.null(step_id)) {
    message("Skipping...")
  } else if (step_id != feature) {
    message("Skipping...")
  } else {
    message("Not skipping ", feature, "!")
    unprepped_recipe$steps[[feature]]$skip <- FALSE
  }
}

unprepped_recipe$steps[["gears_per_carb"]]$skip # FALSE
unprepped_recipe$steps[["wt_per_disp"]]$skip # TRUE

prepped_recipe <- recipes::prep(unprepped_recipe, strings_as_factors = FALSE)

recipes::bake(prepped_recipe, mtcars) # 4 cols: mpg, disp, gears_per_carb, am

jsantiago · August 9, 2021, 10:13am

I figured out a nicer way to do this.
Create a registry like so:

step_feature <- function(step_fn, step_name, deps, args) {
  l <- list(
    list(
      step_fn = rlang::enexpr(step_fn),
      deps = deps,
      args = list(rlang::enexpr(args))
    )
  )

  names(l) <- step_name
  names(l[[1]]$args) <- step_name

  l
}

step_mutate_feat <- purrr::partial(step_feature, step_fn = recipes::step_mutate)

feature_registry <- c(
  step_mutate_feat("hour", "created_at", lubridate::hour(created_at)),
  ....
)

And then select only the features you need and reduce multiple calls to each step function

build_recipe_call_factory <- function(feature_registry) {
  function(recipe, feature) {
    rlang::call2(
      features[[feature]]$step_fn,
      recipe,
      !!!features[[feature]]$args
    )
  }
}

recipe_call <- feature_and_deps_names_used %>%
  purrr::reduce(
    build_recipe_call,
    .init = unprepped_recipe_init
  )


unprepped_recipe_full <- eval(recipe_call)

and finally add a step_rm to remove the features you don't need

unprepped_recipe <- unprepped_recipe_full %>%
  recipes::step_rm(
    dplyr::all_of(unneeded_recipe_vars)
  )

This way a single pipeline can train models with different preprocessing (in the form of {recipe} steps.
I'm not sure yet if there is any performance hit in production from the fact we have so many step_mutate instead of a single large one.

This is far from a reprex. I can make one if anyone shows interest.

system · August 30, 2021, 10:14am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.