I apologize in advance for not having a good reprex available…
I’ve been trying to use tidymodels when doing machine learning at my workplace. Most of the machine learning tasks have involved doing classification on really imbalanced datasets. For some reason, the XGBoost models I create always seem to predict exclusively the minority class and it is always done with a really high probability. In one model I had about 4000 churn cases and 260000 non-churn in the dataset before a stratified split. Still the model predicted every single case to be a churn case on the test set with a probability of over 90 % on almost every case. I would have understood if it predicted every case to be non-churn but not this. When using random forest instead I do get the opposite result where every single case is predicted to be no-churn which should make more sense.
Is there something in particular I have to do when preparing a dataset to get it to predict more correctly? I’ve tried using churn = 1 and non-churn = 0 and also churn = “Churn” and non-churn = “Non-churn”. In both cases the outcome variable has been set as factor. All other string variables, if any, are also converted to factors.
The code I’ve used has been modified, or straight up copied, versions of the excellent tutorials put out by Julia Silge.
Thanks in advance for any advice on what I’m doing wrong.
Below is an example of my latest code:
# Split train/test set.seed(42) churn_split <- initial_split(df strata = churn) churn_train <- training(churn_split) churn_test <- testing(churn_split) # Cross validation set.seed(234) churn_folds <- vfold_cv(churn_train, strata = churn) churn_folds # Recipe xgboost_recipe <- recipe(formula = churn ~ ., data = churn_train) %>% update_role(comp_id, new_role = "ID") %>% step_novel(all_nominal(), -all_outcomes()) %>% step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>% step_zv(all_predictors()) xgboost_recipe # Model specification xgboost_spec <- boost_tree() %>% set_mode("classification") %>% set_engine("xgboost") xgboost_spec # Workflow specification xgboost_workflow <- workflow() %>% add_recipe(xgboost_recipe) %>% add_model(xgboost_spec) xgboost_workflow start_time <- Sys.time() doParallel::registerDoParallel() set.seed(24584) xgboost_res <- fit_resamples(xgboost_workflow, resamples =churn_folds, control = control_resamples(save_pred = TRUE)) end_time <- Sys.time() end_time - start_time xgboost_res collect_metrics(xgboost_res)