Extend Parsnip rand_forest with rotationForest from mananshah99/rotationforest

mattsimmons · August 31, 2020, 5:21am

Hey all -

Have been working on trying to extend parsnip with the rotationForest implementation on GitHub by mananshah99 (https://github.com/mananshah99/rotationforest) - there's another CRAN one, I think, but it has some weird implementation details. Mostly as an exercise to get a feel for how to do this, but - I almost have it working - the fit and predict methods work just fine, but as soon as I try to pass it into tune_grid I get the following error:

Error: Problem with `mutate()` input `object`.
x Error when calling rotationForest(): Error in BuildModel(xdf, ydf, npredictor, ...) : 
  argument "xdf" is missing, with no default

ℹ Input `object` is `purrr::map(call_info, eval_call_info)`.

Digging under the hood with debug() and :::, it looks like the failure mode is the call to parameters(), which works fine on a model with no tune() parameters:

Collection of 0 parameters for tuning

[1] id             parameter type object class  
<0 rows> (or 0-length row.names)

But doesn't like it when it does have them. I think my problem is that I need to wrap the fit method for rotationForest in a way that will return values that play nicely, but reading over the vignettes I just can't seem to get my head around how to do so. I'm sure I'm just about three metres off the track, so any advice would be much appreciated! Setup code below:

library(tidymodels); library(rotationForest);

# Setup -------------------------------------------------------------------

set_model_mode(model = "rand_forest", mode = "classification")
set_model_engine(
  "rand_forest",
  mode = "classification",
  eng = "rotationForest"
)
set_dependency("rand_forest", eng = "rotationForest", pkg = "rotationForest")
set_model_arg(
  model = "rand_forest",
  eng = "rotationForest",
  parsnip = "mtry",
  original = "npredictor",
  func = list(pkg = "rotationForest", fun = "rotationForest"),
  has_submodel = FALSE
)

set_model_arg(
  model = "rand_forest",
  eng = "rotationForest",
  parsnip = "trees",
  original = "ntree",
  func = list(pkg = "rotationForest", fun = "rotationForest"),
  has_submodel = FALSE
)

set_fit(
  model = "rand_forest",
  eng = "rotationForest",
  mode = "classification",
  value = list(
    interface = "data.frame",
    protect = c("xdf", "ydf"),
    func = c(pkg = "rotationForest", fun = "rotationForest"),
    defaults = list()
  )
)

class_info <-
  list(
    pre = NULL,
    post = NULL,
    func = c(fun = "predict"),
    args =
      list(
        rotationForestObject = quote(object$fit),
        dependent = quote(new_data)
      )
  )

set_pred(
  model = "rand_forest",
  eng = "rotationForest",
  mode = "classification",
  type = "class",
  value = class_info
)

set_encoding(model = "rand_forest",
             mode = "classification",
             eng = "rotationForest",
            options = list(predictor_indicators = "none",
                           compute_intercept = FALSE,
                           remove_intercept = FALSE))


# Testing -----------------------------------------------------------------

data("two_class_dat", package = "modeldata")
set.seed(4622)
example_split <- initial_split(two_class_dat, prop = 0.7)
example_train <- training(example_split)
example_test  <-  testing(example_split)
bs_train <- bootstraps(example_train)

### this works just fine:
model <- rand_forest(trees = 100, mtry = 1, mode = "classification") %>%
  set_engine(engine = "rotationForest")
rot_for_fit <- model %>%
  fit(Class ~ ., data =  example_train)
predict(rot_for_fit, new_data = example_train)
## this call works just fine as well:
parameters(model)

## but when we try something more complicated
recipe_rot <- recipe(Class ~ ., data = example_train) %>%
  step_normalize(all_predictors())
model_grid <- rand_forest(trees = tune(), mtry = tune(), mode = "classification") %>%
  set_engine(engine = "rotationForest")
wf <- workflow() %>%
  add_recipe(recipe_rot) %>%
  add_model(model_grid)
grid_rot <- grid_random(finalize(mtry(), x = example_train), trees(), size = 3)
## it fails here:
tune_grid(wf, resamples = bs_train, grid = grid_rot)
## this call also fails:
parameters(model_grid)

Max · August 31, 2020, 3:01pm

I think that the main issue is that the function/package declarations should reference where the parameter functions are found (as opposed to the model fit function). I would change these to be:

set_model_arg(
  model = "rand_forest",
  eng = "rotationForest",
  parsnip = "mtry",
  original = "npredictor",
  func = list(pkg = "dials", fun = "mtry"),# <- changed this
  has_submodel = FALSE
)

set_model_arg(
  model = "rand_forest",
  eng = "rotationForest",
  parsnip = "trees",
  original = "ntree",
  func = list(pkg = "dials", fun = "trees"), # <- changed this

A potential second issue is that tune_grid() will try to compute class probabilities because the default metrics includes the area under the ROC curve. You should probably set that too:

prob_info <-
  list(
    pre = NULL,
    # The predict method returns a matrix so add a post-processor
    post = function(x, object) { 
      tibble::as_tibble(x)
    },
    func = c(fun = "predict"),
    args =
      list(
        rotationForestObject = quote(object$fit),
        dependent = quote(new_data),
        prob = TRUE
      )
  )

set_pred(
  model = "rand_forest",
  eng = "rotationForest",
  mode = "classification",
  type = "prob",
  value = prob_info
)

I still had some issues when testing this out. For three different data sets, I had the error:

Error in svd(x, nu = 0, nv = k): a dimension is zero

during model fitting.

I tried it on the example data set from the package and it worked (albeit very slowly):

fpath <- system.file("extdata", "balance-scale.data", package="rotationForest")
data <- read.table(fpath, sep = ",", header = FALSE)
data.dependent <- data[,-1]
data.response <- data[,1]
data.response <- as.factor(data.response)
total <- data.frame(data.response, data.dependent)
groups <- sample(rep(1:10, times = ceiling(nrow(total) / 19)), size = nrow(total), replace = TRUE)
data.train <- total[!groups %in% 1,]
data.test <- total[groups %in% 1,]


set.seed(4622)
bs_train <- bootstraps(data.train, times = 3)

recipe_rot <- recipe(data.response ~ ., data = data.train) %>%
  step_normalize(all_predictors())
model_grid <- rand_forest(trees = tune(), mtry = tune(), mode = "classification") %>%
  set_engine(engine = "rotationForest")
wf <- workflow() %>%
  add_recipe(recipe_rot) %>%
  add_model(model_grid)
parameters(wf)
grid_rot <- grid_random(mtry(c(1, 4)), trees(), size = 3)
res <- tune_grid(wf, resamples = bs_train, grid = grid_rot)

mattsimmons · August 31, 2020, 10:02pm

Thanks Max! The arguments required for set_model_arg were a bit mysterious to me (I think a lot of the vingettes use "foo" and "bar"?) but that makes a lot more sense!

mattsimmons · September 1, 2020, 5:00am

Thanks Max! That worked perfectly. Really appreciate having you around to answer questions like this!

I noticed the svd() problem as well - I thought it might be to do with trying to fit on a too-small dataset, but the fit() method has no problem even fitting on slice_sample(df_train, prop = .1) For some reason, the internal rotationForest::BuildModel function doesn't handle the data frame it gets from tune_grid properly, and tries to pass an empty data frame to prcomp(). I'll keep poking at it.

mattsimmons · September 2, 2020, 12:49am

I figured it out, I think! It's partially a fault of the rotationForest package, partially a fault of me being silly.

rotationForest (this implementation, anyway) doesn't have any error handling for when npredictor > ncol(xdf). So in the two_class_dat example, if you pass npredictor = 3, it'll fail with:

Error in svd(x, nu = 0, nv = k) : a dimension is zero

I made a mistake as well - finalize(mtry(), data) should be passed on data that's already been juice()'d - if you use the original data frame, it'll include the dependent variable and so sometimes grid_random(finalize(mtry(), data)) will include more variables than are reasonable to pass for mtry(), i.e more predictors than there are variables.

I'm sure rpart:: would handle this gracefully, but it looks like maybe the way rotationForest builds up the bagged subset to pass to prcomp() does something weird before it even makes it to rpart(), and so it tries to svd() an empty matrix.

Anyway, I hope this is useful for someone else in the deep future with the same problems!

mattsimmons · September 2, 2020, 5:04am

NB I've created a package wrapper for this here:

Also hopefully useful as a minimal example of how to extend an existing parsnip model with a new engine.

system · September 9, 2020, 5:04am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.