Classification problem - questions about current metrics output and how to improve model

ML_Rookie_2021 · April 2, 2021, 1:49pm

Hello all. I need some help with a classification problem.

SOME CONTEXT:
The goal is to predict whether a customer will subscribe to a term deposit. The target variable "y" is either "yes" or "no". From a business standpoint, I believe I should be more interested in the "yes". The dataset is highly imbalanced with no = 89% and yes = 11%. Note that the "duration" variable should be excluded as duration is only determined AFTER the outcome y is also know.

MY QUESTIONS:

I'm a bit confused by the metrics sensitivity and specificity from the model. When I output the confusion matrix, the "sensitivity" appears to be the the proportion of "no" that were classified as "no", while the specificity appears to be the proportion of "yes" that were classified as "yes". Shouldn't this be flipped, shouldn't the sensitivity be the proportion of "yes" that were classified as "yes"?
Given the current output, I believe I should be interested in the specificity. Right now the highest specificity I
can achieve is about 70%. While my code below is for the logistic model, I have also tried KNN and Random Forest. Those models while having higher overall accuracy, they perform worse than the logistic model in terms of specificity. Are there additional steps I can take to improve my model (specificity)? One thing I though of was interactions however, I'm not sure how to look for them in a logistic regression.

REPRODUCIBLE CODE:
The code below does the following : load data, split into train/test set, create cv folds, create recipe, create workflow, create tune grid, train model, collect metrics, print confusion matrix

# Packages

library(tidyverse)
library(tidymodels)

# Import Data ##################################################################

temp <- tempfile()
url <- 
  "https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip"
download.file(url, temp)
unzip("dataset.zip", exdir = "./")
data <- read.csv(unz(temp, "bank-additional/bank-additional-full.csv"), 
                     header = TRUE,
                     sep = ";")

data <- 
  data %>% 
  as_tibble()


# Class Imbalance
data %>% count(y) %>% 
  mutate(prop = n/sum(n))

# MODELING #####################################################################

# Mutate chr to fct
data <- 
  data %>% 
  mutate_if(is.character, as.factor)

# Split data into training and testing
set.seed(125)
split <- 
  initial_split(data, prop = 0.80, strata = y)

train_data <- 
  training(split)

test_data <- 
  testing(split)


# Cross Validation Spec
set.seed(138)
cv <- vfold_cv(data, v = 10)

# Recipe
recipe_logistic <- 
  recipe(y ~ ., data = train_data) %>% 
  step_rm(duration) %>%
  step_zv(all_predictors()) %>%
  step_YeoJohnson(all_numeric()) %>% 
  step_normalize(all_numeric()) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  themis::step_upsample(y) %>%
  prep()


# Model Spec
logistic_model <- 
  logistic_reg(penalty = tune(), mixture = 1) %>% 
  set_engine("glmnet") %>% 
  set_mode("classification")

# Workflow
workflow <- 
  workflow() %>% 
  add_model(logistic_model) %>% 
  add_recipe(recipe_logistic)


# Logistic Regression Grid
grid <- 
  tibble(penalty = 10^seq(-4, -2, 
                          length.out = 20))


# Train Model 
set.seed(194)
results <- 
  workflow %>% 
  tune_grid(cv,
       grid    = grid,
       control = control_grid(save_pred = TRUE),
       metrics = metric_set(roc_auc, accuracy, sensitivity, specificity))

# Collect Metrics
results %>% 
  show_best(metric = "roc_auc") %>% 
  arrange(desc(mean))


# Best Parameters
best_param <- 
  results %>% 
  select_best()

# Logistic Regression Confusion Matrix
results %>% 
  collect_predictions(parameters = best_param) %>% 
  conf_mat(y, .pred_class) %>% 
  autoplot(type = "heatmap")+
  labs(title = "Confusion Matrix")

Thanks in advance and happy to answer any additional questions.

system · April 23, 2021, 1:49pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.