Issue with tidymodels workflows and fitting xgboost models

cengelhardt · July 31, 2020, 5:58pm

I'm having some trouble using tidymodels workflows to fit a tuned xgboost model with cross-validation. When I check the "notes" column, I see quite a few errors. The reprex below mimics my data.

Am I missing something obvious?

library(doParallel)
#> Warning: package 'doParallel' was built under R version 4.0.2
#> Loading required package: foreach
#> Warning: package 'foreach' was built under R version 4.0.2
#> Loading required package: iterators
#> Warning: package 'iterators' was built under R version 4.0.2
#> Loading required package: parallel
library(tidymodels)
#> Warning: package 'tidymodels' was built under R version 4.0.2
#> -- Attaching packages ---------------------------------------------------------------------- tidymodels 0.1.1 --
#> v broom     0.7.0      v recipes   0.1.13
#> v dials     0.0.8      v rsample   0.0.7 
#> v dplyr     1.0.0      v tibble    3.0.3 
#> v ggplot2   3.3.2      v tidyr     1.1.0 
#> v infer     0.5.3      v tune      0.1.1 
#> v modeldata 0.0.2      v workflows 0.1.2 
#> v parsnip   0.1.2      v yardstick 0.0.7 
#> v purrr     0.3.4
#> Warning: package 'broom' was built under R version 4.0.2
#> Warning: package 'dials' was built under R version 4.0.2
#> Warning: package 'scales' was built under R version 4.0.2
#> Warning: package 'dplyr' was built under R version 4.0.2
#> Warning: package 'ggplot2' was built under R version 4.0.2
#> Warning: package 'infer' was built under R version 4.0.2
#> Warning: package 'modeldata' was built under R version 4.0.2
#> Warning: package 'parsnip' was built under R version 4.0.2
#> Warning: package 'purrr' was built under R version 4.0.2
#> Warning: package 'recipes' was built under R version 4.0.2
#> Warning: package 'rsample' was built under R version 4.0.2
#> Warning: package 'tibble' was built under R version 4.0.2
#> Warning: package 'tidyr' was built under R version 4.0.2
#> Warning: package 'tune' was built under R version 4.0.2
#> Warning: package 'workflows' was built under R version 4.0.2
#> Warning: package 'yardstick' was built under R version 4.0.2
#> -- Conflicts ------------------------------------------------------------------------- tidymodels_conflicts() --
#> x purrr::accumulate() masks foreach::accumulate()
#> x purrr::discard()    masks scales::discard()
#> x dplyr::filter()     masks stats::filter()
#> x dplyr::lag()        masks stats::lag()
#> x recipes::step()     masks stats::step()
#> x purrr::when()       masks foreach::when()
library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.0.2
#> Warning: package 'readr' was built under R version 4.0.2
#> Warning: package 'stringr' was built under R version 4.0.2
#> Warning: package 'forcats' was built under R version 4.0.2

set.seed(3434)
data <- tibble(outcome = rnorm(3000, 100, 15),
               pred_1 = rnorm(3000, 20, 10),
               pred_2 = sample(c("lev1", "lev2", "lev3"), 
                               size = 3000, 
                               replace = TRUE),
               pred_3 = sample(c("lev1", "lev2", "lev3"), 
                               size = 3000, 
                               replace = TRUE),
               pred_4 = sample(c("lev1", "lev2", "lev3"), 
                               size = 3000, 
                               replace = TRUE))

data <- mutate_if(data, is.character, factor) 

data_split <- initial_split(data, 
                            prop = .75, 
                            strata = outcome) 

training <- training(data_split) 
testing <- testing(data_split)

my_recipe <- recipe(outcome ~ ., data = training) %>%  
  step_nzv(all_nominal()) #remove near-zero-variance predictors

xgb_spec <- boost_tree(trees = 200, 
                       tree_depth = tune(), #number of splits
                       mtry = tune(), #introducing randomness         
                       learn_rate = tune()) %>% #step size
  set_engine("xgboost") %>% 
  set_mode("regression")

xgb_grid <- grid_latin_hypercube(
  tree_depth(),
  finalize(mtry(), training), #based on # of predictors
  learn_rate(),
  size = 6
)

xgb_wf <- workflow() %>%
  add_recipe(my_recipe) %>% 
  add_model(xgb_spec) 

xgb_folds <- vfold_cv(training, strata = outcome, v = 10)

registerDoParallel()

xgb_res <- tune_grid(
  object = xgb_wf,
  grid = xgb_grid,
  resamples = xgb_folds,
  control = control_grid(save_pred = TRUE)
)
#> Warning: All models failed in tune_grid(). See the `.notes` column.

jrmuirhead · August 1, 2020, 2:14am

Hi cengelhardt,

I'm not sure of some of the internal mechanics of tidymodels, but doesn't xgboost require categorical predictors to be either dummy or one-hot encoded as the final step in my_recipe?

cengelhardt · August 1, 2020, 4:06pm

That's what I thought as well!

A few notes:

Of interest, @julia had a recent blog post where none of the factors were converted to dummy variables prior to fitting models: https://juliasilge.com/blog/xgboost-tune-volleyball/. In her post, she states "we don’t need to worry about the factors." So, I didn't worry much about the factors. Indeed, I can replicate her results by re-running her code (although I did make some small adjustments to speed up the model fitting process, the factors remained without issue). Update: factors do seem to need to be handled with the recipe interface, but not with the formula interface, as she used.
For my personal data, which has a continuous outcome, the factors do need to be converted prior to model fit. I did try this yesterday, but the models still failed. The reason (or one potential reason) for this, as I discovered this morning, is that the data need to be in a data frame only (not a tibble), or else the models do not run. If I leave my data in a tibble, I receive the following error: model 6/6: Error: y should be one of the following classes: 'data.frame', 'matrix', 'factor'. If I run as.data.frame() on my data (only change) prior to fitting the models, I have no issues. This was an unexpected finding.
For the reprex above, as you note, changing the factors to dummy in the recipe allows the models to run without issue. Xgboost handles the tibbles just fine.

I am still somewhat confused. Perhaps there is some oddity about the data I am working with (e.g., my data were read in from a SAS file using haven), or some nuance in how regression vs classification models are fit with xgboost.

.

julia · August 1, 2020, 4:48pm

I think it sounds like you generally are understanding this right, but just for the sake of maybe over-explaining yes, xgboost is one of the kinds of models that can't handle factor data on its own. That means we need to preprocess that factors into numeric data somehow. There are a couple of ways to do this.

One way is to do this with the formula interface that R has. This is what I showed in my blog post. It is easy to use but doesn't let you do many other kinds of preprocessing steps very flexibly. To do this, you should switch out your workflow:

xgb_wf <- workflow() %>%
  add_formula(outcome ~ .) %>% 
  add_model(xgb_spec)

The other way to preprocess the factors is to do it yourself using a recipe, which is of course less familiar to R users but gives you a lot more options of other preprocessing to do alongside, including removing zero-variance predictors. To do this, you would keep the workflow as you have it, but change the recipe that you have:

my_recipe <- recipe(outcome ~ ., data = train_df) %>%  
  step_nzv(all_nominal()) %>%
  step_dummy(all_nominal())

The step_dummy() function creates dummy or indicator variables from nominal (factor or character) variables.

julia · August 1, 2020, 4:52pm

For the issue you discuss in number 2 @cengelhardt that sounds a bit concerning to me. If you are able to create a reprex that demonstrates the problem (even after using either a formula interface or using step_dummy() as I outlined above), it would be super helpful if you can post it to GitHub as an issue so we can dig into it.

cengelhardt · August 1, 2020, 7:05pm

Thank you for your reply and helpful explanation, @julia! This makes sense.

Issue #2 seems to be related to the file being read in, rather than related to the formula interface or the recipe interface.

More specifically, if I read in my character compressed SAS file with haven, both the formula interface and the recipe interface fail, unless I first convert the tibble to a data frame with as.data.frame(). If I do this conversion, both interfaces (recipe and formula) work. If I don't do this conversion, I receive the error mentioned above: Error: y should be one of the following classes: 'data.frame', 'matrix', 'factor'.

If, however, I read in the SAS file, immediately write it back out with readr::write_csv(), and then read in again with readr::read_csv(), both the recipe interface and formula interface work just fine within the tibble. I do not need to convert first to a vanilla data frame.

I've verified that the data frame values are identical across the two files. Perhaps an attribute issue?

Based on these results, would you consider this to be an issue within the scope of tidymodels/parsnip? If you think it is, I'd be happy to file an issue and reprex on GitHub.

julia · August 1, 2020, 7:19pm

Aaaaaaah yep, I bet it's the attributes, i.e. the label attributes on each variable in the output.

What you want somehow is to get a tibble (or maybe dataframe) without attributes. I don't think writing/reading to CSV is going to be the fastest way to do this. Converting via as.data.frame() will probably be fine in most circumstances.

This is what we do in tune when we want a new "bare" (i.e. no attributes) tibble:

github.com

tidymodels/tune/blob/0bbbc8bef5ec711d32104411c412f2e788a76b30/R/utils.R#L42



#' @export
#' @keywords internal
#' @rdname empty_ellipses
is_workflow <- function(x) {
  inherits(x, "workflow")
}

# new_tibble() currently doesn't strip attributes
# https://github.com/tidyverse/tibble/pull/769
new_bare_tibble <- function(x, ..., class = character()) {
  x <- vctrs::new_data_frame(x)
  tibble::new_tibble(x, nrow = nrow(x), ..., class = class)
}

## -----------------------------------------------------------------------------

#' Various accessor functions
#'
#' These functions return different attributes from objects with class
#' `tune_result`.

We use two steps:

vctrs::new_data_frame() which results in a data.frame
tibble::new_tibble() which results in a tibble

At the end of that we have a tibble with no attributes.

cengelhardt · August 1, 2020, 7:26pm

@julia very cool! I will test this.

I also dug into haven a bit more and came across haven::zap_label(), which drops the label attributes.

You're right -- writing to CSV just read in again is not a great long-term solution for my colleagues who use SAS and are considering the merits of R

Thanks for your help!

cengelhardt · August 1, 2020, 7:47pm

@julia new_bare_tibble() doesn't seem to strip the attributes from my tibbles. For example, attributes(my_tibble) still returns attributes after calling new_bare_tibble. Can you provide an example of how this function should work? Perhaps there is a better test of the function.

Of note, haven::zap_label() does seem to solve my initial problem. Thanks!

julia · August 1, 2020, 8:44pm

Hmmm, I am realizing now that the new_bare_tibble() approach we use in tune removes attributes that belong to the whole object but not attributes at the column/variable level, actually. (I should have checked before mentioning it!)

I am glad that haven::zap_labels() looks like a solution.

cengelhardt · August 1, 2020, 9:36pm

No worries! Was fun to try!

To close the loop on this issue, the solution was to use haven::zap_label(), which removes the variable labels, not haven::zap_labels(), which removes the value labels. At issue here seemed to be the variable labels rather than the values.

Thanks again, @julia! Hope you have a great rest of your weekend!

hnagaty · August 19, 2020, 9:17am

As a side note, I prefer not to use doParrallel() and instead rely on XBoost builtin parallel processing. This can be controlled via the nthread parameter. For example, set_engine("xgboost", nthread = 8).
One benefit over using a parallel backend, is that the tune_grid() vebose output becomes more informative. As per this link, https://tune.tidymodels.org/articles/extras/optimizations.html, "almost all of the logging provided by tune_grid() will not be seen when running in parallel."
I would be glad to hear your openion about which approach is better.

Max · August 20, 2020, 2:58pm

If you are resampling the model, using the xgboost parallelism is fairly sub-optimal. You want to parallelize the longest loop which, in this case, is resampling. There are some benchmakrs too. See the last figure in that post.

system · September 10, 2020, 2:58pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.