xgboost accuracy changes with expand.grid vs specified parameters

Thank you in advance for the read. I have been working with a data set at work using xgboost (via caret), setting my seed for reproducibility and tuning the parameters. When I use expand.grid I am able to get a higher accuracy on the model (and better prediction of my test set) than when I use the same parameters (found via model$results$besttune) in expand.grid without any sequence. I've done my best to generate a reproducible example but am having a hard time doing so. This leads me to think that it may be because my model is overfit. Please note that in my real-world model, I've shrunk the expand.grid to a more optimized size (in case that is someone's suggestion). I've also removed the seed to see how stable the model accuracy is and it is definitely quite variable (76% on the test set is the highest I've seen and 6 other models give 61%-73%)

Any ideas on why this is? In my real world work, the accuracy goes from 76% on the test set down to about 71% on the test set with this one change. Test set is 20% of the data (n = 167)

In case it helps, the grid search is:
max_depth = c(3, 4, 5),
nrounds = seq(from = 25, to = 95, by = 10),
eta = c(0.025, 0.05, 0.1),
gamma = 0,
colsample_bytree = c(0.6,0.8),
min_child_weight = 1,
subsample = 1

The best tune is:
max_depth = 3,
nrounds = 65,
eta = 0.1,
gamma = 0,
colsample_bytree = 0.6,
min_child_weight = 1,
subsample = 1

Since I can't come up with a reprex that actually works (I tried three different times and got stable results), I am asking this in a more theory sense rather than "how do I make this code work."

For discussions related to modeling, machine learning and deep learning. Related packages include caret, modelr, yardstick, rsample, parsnip, tensorflow, keras, cloudml, and tfestimators.

A few basic questions:

How much data are there in training and testing? Is it classification and, if so, what are the frequencies of each class.

What were the details of resampling?

I did an 80/20 split of the data

  • training set: 674 observations
  • test set: 167 observations

I am looking at classification with frequencies close to 50/50 (and I did split the training and test set on the OUTCOME to preserve frequency of observation):
A tibble: 2 x 3
OUTCOME protocols pct

1 FALSE 394 0.468
2 TRUE 447 0.532

As far as resampling, I am using cross-validation with 3 repeats.

My grid, control, and train model looks like this:

tunegrid <- expand.grid(max_depth = 3, 
                        nrounds = 65, 
                        eta = 0.1,
                        gamma = 0, 
                        colsample_bytree = 0.6, 
                        min_child_weight = 1, 
                        subsample = 1)

tunecontrol <- caret::trainControl(
  method = "cv",
  number = 3,
  classProbs = TRUE,
  returnData= TRUE,
  verboseIter = FALSE)

set.seed(2624)
xgtree_model <- train(x = train_baked %>% dplyr::select(-OUTCOME),
                      y = train_baked$OUTCOME, 
                      trControl = tunecontrol,
                      tuneGrid = tunegrid,
                      method = "xgbTree")

I was mainly wondering if the noise around the results (due to data size and other factors) makes it seem like the results are worse. That's not really the case.

Reading you post though, it seems like you are optimizing on the test set (emphasis mine)

That's not a great idea for a lot of reasons. My main thought is that you are comparing resampling results to your test set results and that may not be a fair comparison.

If anything, using a single parameters combination on your test set greatly increases the risk of overfitting since you don't have any independent way of validating your choice :frowning_face:

Yeah, it definitely isn't ideal. The size of the data set is small and isn't ideal for two test sets - one to optimize parameters and another to hold out for validation. Looking at the accuracy of the fit model, it fluctuates similarly to the variability I'm seeing in the test set but is similar accuracy to the test set (which is why I wasn't thinking overfitting in the first place).

Thanks for the feedback. I'll go back to the drawing board and see if I can better parse out this data set for fit, test, and validation.

1 Like