My XGBoost model always predict the minority class

I apologize in advance for not having a good reprex available…

I’ve been trying to use tidymodels when doing machine learning at my workplace. Most of the machine learning tasks have involved doing classification on really imbalanced datasets. For some reason, the XGBoost models I create always seem to predict exclusively the minority class and it is always done with a really high probability. In one model I had about 4000 churn cases and 260000 non-churn in the dataset before a stratified split. Still the model predicted every single case to be a churn case on the test set with a probability of over 90 % on almost every case. I would have understood if it predicted every case to be non-churn but not this. When using random forest instead I do get the opposite result where every single case is predicted to be no-churn which should make more sense.

Is there something in particular I have to do when preparing a dataset to get it to predict more correctly? I’ve tried using churn = 1 and non-churn = 0 and also churn = “Churn” and non-churn = “Non-churn”. In both cases the outcome variable has been set as factor. All other string variables, if any, are also converted to factors.

The code I’ve used has been modified, or straight up copied, versions of the excellent tutorials put out by Julia Silge.

Thanks in advance for any advice on what I’m doing wrong.

Below is an example of my latest code:

# Split train/test
set.seed(42)
churn_split <- initial_split(df strata = churn)
churn_train <- training(churn_split)
churn_test <- testing(churn_split)

# Cross validation
set.seed(234)
churn_folds <- vfold_cv(churn_train, strata = churn)
churn_folds

# Recipe
xgboost_recipe <- 
  recipe(formula = churn ~ ., data = churn_train) %>%
  update_role(comp_id, new_role = "ID") %>%
  step_novel(all_nominal(), -all_outcomes()) %>% 
  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>% 
  step_zv(all_predictors())

xgboost_recipe

# Model specification
xgboost_spec <- 
  boost_tree() %>% 
  set_mode("classification") %>% 
  set_engine("xgboost") 

xgboost_spec

# Workflow specification
xgboost_workflow <- 
  workflow() %>% 
  add_recipe(xgboost_recipe) %>% 
  add_model(xgboost_spec) 

xgboost_workflow


start_time <- Sys.time()
doParallel::registerDoParallel()

set.seed(24584)
xgboost_res <- fit_resamples(xgboost_workflow,
            resamples =churn_folds,
            control = control_resamples(save_pred = TRUE))

end_time <- Sys.time()
end_time - start_time

xgboost_res

collect_metrics(xgboost_res)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.