tidymodels xgboost tuning - automatically optimizing a sequence of hyperparameters in turn?

This is somewhat inspired by what people commonly do on Kaggle and the optuna (in Python)LightGBMTunerCV function. For something like xgboost it's a pretty sensible approach to start with some sensible default hyperparameters and then tune hyperparameters (assessed by cross-validation, probably using a really high number of maximum trees + early stopping) in a sensible order starting with the ones for which the best choice least depends on the best choice for other hyperparameters. E.g. an order like this: First the depth of the trees, then how many records you subsample per tree, then the minimum child weight, and then how much you sub-sample predictor columns. You might even do this with a high learning rate and then for the final hyperparameter values re-do the CV assessment to find a good number of trees, before your final refit. That gets to relatively decent hyperparameters decently fast.

Now, my question was how would I automate this with tune? The first sweep over the first hyperparameter is easy to set-up, but is there an in-buildt function (or one that I should use to define a generic strategy) that then easily let's me select the "best" hyperparameter value from that first sweep as the value for that hyperparameter before I then sweep over the next hyperparameter?

I.e. let's say I tried tree depths from 2 to 32 (while keeping all other hyperparameters at my chosen defaults) and 12 seems like a really good value, now I want to change the default for the tree depth to 12 and try different values for the proportion of records I subsample for each tree (e.g. between 0.2 to 1.0).

I.e. this is really an iterative sequence of doing a grid search in one dimension followed by some basic if-then-else-logic for setting up the next grid search, but ideally this would all fit into the tune framework. Does a solution for this already exist? I did not spot a function that seems to be intended for this, but perhaps I overlooked it.

I don't think that this is a great idea for a host of reasons.

  • There is a whole host of literature in experimental design about how "one-factor-at-a-time" approaches are inefficient and less likely to find optimal results.
  • Doing the optimization stage-wise minimizes the benefits of parallel processing and the sub-model trick. In the link for the last point, we did an experiment with C5 boosting and found:

Between the submodel optimization trick and parallel processing, there was a total speed-up of 282-fold over the most basic grid search code.

  • These models are not especially hard to tune. While there are a lot of tuning parameters, we often find a plateau of optimal performance where multiple candidate points have nearly the same optimal performance.

  • On top of these benefits, the same approach with racing can be even faster especially if you are evaluating a lot of candidate models.

I'm not saying that they are wrong but I think a more basic, efficient approach works really well.

1 Like

You are right in general. However, it may be the case that for xgboost/LightGBM one-factor-at-a-time works decently, if one chooses the order of factors well (possibly also because multiple combinations of hyperparameter values can be decent). It definitely will not be optimal, but one usually gets something pretty decent very fast. In my experience on Kaggle, what I got this way within, say, 30 min was just a little worse than hours of a full Bayesian optimization (with saving at before 8 hours are up and loading for another 8 hour run etc.). A little worse matters a lot on Kaggle, so I would not use it as a final Kaggle solution when I have a lot of time available, but it's been my starting point when I don't have much time.

It's an interesting question how this stacks up with the alternatives you mention. Am I right in thinking that all those tricks can be applied in one-parameter-at-a-time?

  • The sub-model trick can mostly be used for the number of trees with xgboost (I suppose one could try to do a post-hoc L1-/L2-penalization of the end nodes or something, but I don't think that's available out of the box), but that's already effectively being done in what I described (i.e. early stopping - unless one sets too low a patience - does essentially that for the number of trees, right?).
  • Parallel processing can also be use for one-parameter-at-a-time across the whole slice along one parameter (i.e. all the parameter values for all cross-validation folds can run in parallel), so as long as I don't have an absurdly powerful system, I can keep all cores busy nearly all the time (let's say I'm trying 20 parameter values across 5 CV-folds, but at work it's hard for me to get more than 16 CPUs at once and privately I have fewer).
  • Racing is a cool idea. But I think it could also be used for one-parameter-at-a-time. In fact, I've wondered whether it can also be used for Bayesian optimization (missing fold results are simply a missing data problem).

So, I'm assume that as long as I apply the same tricks to the one-pararameter-at-a-time approach it might then still be faster (but have a lower performance ceiling). A good empirical study of that might be useful, I suppose.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.