"Safety net" to deal with system crushes: How to break down a `workflowsets` procedure by writing to as many `.rds` files on the go?

I'm using workflowsets to compare between different models. The data is relatively large (> 500K rows) and the entire procedure is expected to take several days to complete. Unfortunately, the machine I'm running this on is (1) remote; and (2) easy to crash. Since the unstable machine is a given, one strategy to overcome the instability is to break up the fitted workflows to individual .rds files. My hope was, that once I have one .rds file per completed workflow, I could load all rdss and bind them together after the fact.

Sadly, it turns out that one rds file per 1 workflow is nevertheless too large of a unit, as I can't even get such one unit to complete. Therefore, I need a "safety net" that writes more rds files and more frequently, such that if the machine crashes at any moment, I could resume without losing the time invested in computing previous work up to the crash.

My question is: how can I program around workflowsets to break down -- on the go -- to as many small .rds files that I could build back again at the end?

In my real-life situation, I have 2 recipes × 5 model specs = 10 workflows. I'm using a 10-fold cross validation, and a tuning grid of 25.

To anchor my question in a reproducible example, please consider the following code that I took almost as-is from www.tmwr.org/workflow-sets.html.

library(tidymodels)
library(rules)
library(baguette)

tidymodels_prefer()
data(concrete, package = "modeldata")

concrete <- 
  concrete %>% 
  group_by(across(-compressive_strength)) %>% 
  summarize(compressive_strength = mean(compressive_strength),
            .groups = "drop")

set.seed(1501)
concrete_split <- initial_split(concrete, strata = compressive_strength)
concrete_train <- training(concrete_split)
concrete_test  <- testing(concrete_split)

set.seed(1502)
concrete_folds <- 
  vfold_cv(concrete_train, strata = compressive_strength, repeats = 5)

normalized_rec <- 
  recipe(compressive_strength ~ ., data = concrete_train) %>% 
  step_normalize(all_predictors()) 

poly_recipe <- 
  normalized_rec %>% 
  step_poly(all_predictors()) %>% 
  step_interact(~ all_predictors():all_predictors())


linear_reg_spec <- 
  linear_reg(penalty = tune(), mixture = tune()) %>% 
  set_engine("glmnet")

nnet_spec <- 
  mlp(hidden_units = tune(), penalty = tune(), epochs = tune()) %>% 
  set_engine("nnet", MaxNWts = 2600) %>% 
  set_mode("regression")

mars_spec <- 
  mars(prod_degree = tune()) %>%  #<- use GCV to choose terms
  set_engine("earth") %>% 
  set_mode("regression")

svm_r_spec <- 
  svm_rbf(cost = tune(), rbf_sigma = tune()) %>% 
  set_engine("kernlab") %>% 
  set_mode("regression")

svm_p_spec <- 
  svm_poly(cost = tune(), degree = tune()) %>% 
  set_engine("kernlab") %>% 
  set_mode("regression")

knn_spec <- 
  nearest_neighbor(neighbors = tune(), dist_power = tune(), weight_func = tune()) %>% 
  set_engine("kknn") %>% 
  set_mode("regression")

cart_spec <- 
  decision_tree(cost_complexity = tune(), min_n = tune()) %>% 
  set_engine("rpart") %>% 
  set_mode("regression")

bag_cart_spec <- 
  bag_tree() %>% 
  set_engine("rpart", times = 50L) %>% 
  set_mode("regression")

rf_spec <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% 
  set_engine("ranger") %>% 
  set_mode("regression")

xgb_spec <- 
  boost_tree(tree_depth = tune(), learn_rate = tune(), loss_reduction = tune(), 
             min_n = tune(), sample_size = tune(), trees = tune()) %>% 
  set_engine("xgboost") %>% 
  set_mode("regression")

cubist_spec <- 
  cubist_rules(committees = tune(), neighbors = tune()) %>% 
  set_engine("Cubist") 

nnet_param <- 
  nnet_spec %>% 
  parameters() %>% 
  update(hidden_units = hidden_units(c(1, 27)))

normalized <- 
  workflow_set(
    preproc = list(normalized = normalized_rec), 
    models = list(SVM_radial = svm_r_spec, SVM_poly = svm_p_spec, 
                  KNN = knn_spec, neural_network = nnet_spec)
  ) %>%
  option_add(param_info = nnet_param, id = "normalized_neural_network")

model_vars <- 
  workflow_variables(outcomes = compressive_strength, 
                     predictors = everything())

no_pre_proc <- 
  workflow_set(
    preproc = list(simple = model_vars), 
    models = list(MARS = mars_spec, CART = cart_spec, CART_bagged = bag_cart_spec,
                  RF = rf_spec, boosting = xgb_spec, Cubist = cubist_spec)
  )

with_features <- 
  workflow_set(
    preproc = list(full_quad = poly_recipe), 
    models = list(linear_reg = linear_reg_spec, KNN = knn_spec)
  )

all_workflows <- 
  bind_rows(no_pre_proc, normalized, with_features) %>% 
  # Make the workflow ID's a little more simple: 
  mutate(wflow_id = gsub("(simple_)|(normalized_)", "", wflow_id))

grid_ctrl <-
  control_grid(
    save_pred = TRUE,
    parallel_over = "everything",
    save_workflow = TRUE
  )

grid_results <-
  all_workflows %>%
  workflow_map(
    seed = 1503,
    resamples = concrete_folds,
    grid = 25,
    control = grid_ctrl, verbose = TRUE
  )

If we focus on just the final piece of that code:

  • we have the all_workflows object

        > all_workflows
        ## # A workflow set/tibble: 12 x 4
        ##    wflow_id             info             option    result    
        ##    <chr>                <list>           <list>    <list>    
        ##  1 MARS                 <tibble [1 x 4]> <opts[0]> <list [0]>
        ##  2 CART                 <tibble [1 x 4]> <opts[0]> <list [0]>
        ##  3 CART_bagged          <tibble [1 x 4]> <opts[0]> <list [0]>
        ##  4 RF                   <tibble [1 x 4]> <opts[0]> <list [0]>
        ##  5 boosting             <tibble [1 x 4]> <opts[0]> <list [0]>
        ##  6 Cubist               <tibble [1 x 4]> <opts[0]> <list [0]>
        ##  7 SVM_radial           <tibble [1 x 4]> <opts[0]> <list [0]>
        ##  8 SVM_poly             <tibble [1 x 4]> <opts[0]> <list [0]>
        ##  9 KNN                  <tibble [1 x 4]> <opts[0]> <list [0]>
        ## 10 neural_network       <tibble [1 x 4]> <opts[1]> <list [0]>
        ## 11 full_quad_linear_reg <tibble [1 x 4]> <opts[0]> <list [0]>
        ## 12 full_quad_KNN        <tibble [1 x 4]> <opts[0]> <list [0]>
    
  • all_workflows gives rise to the heavy-lifting procedure that uses workflow_map():

    grid_results <-
      all_workflows %>%
      workflow_map(
      seed = 1503,
      resamples = concrete_folds,
      grid = 25,
      control = grid_ctrl, verbose = TRUE
      )
    

If I run it just like this -- it hangs for many hours then fails and I get nothing. Alas, even if I do one row at a time and save to rds, it fails too and I still get nothing:

fitted_wflow_1 <- 
  all_workflows[1, ] %>% ## my intent was to manually change this each time: to all_workflows[2, ], etc...
  workflow_map(
    seed = 1503,
    resamples = concrete_folds,
    grid = 25,
    control = grid_ctrl, verbose = TRUE
  ) 
  
saveRDS(fitted_wflow_1, "fitted_wflow_1.rds") ## sadly I never get to here because `fitted_wflow_1` never gets created

Bottom line is: how can I split up all_workflows beyond 1-rds-per-row? How fragmented can I get with as many little rds pieces?

1 Like

Unless it seg faults or something equally catastrophic, the workflow set should not fail if the model does. Other than that, I don't think that you can save-as-you-go unless you build the package yourself and add a saveRDS().

@Max , thanks. Well maybe not saving on the go, but even manually splitting each model to different folds or tuning grids? Then I could save each little piece to an R object -> RDS, and possibly combine them after the fact. I have little intuition to whether this even makes sense though...

It's not the workflowset that's failing, but the entire R session being terminated due to IT reasons I cannot control.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.