Hello all. I need some help with a classification problem.
SOME CONTEXT:
The goal is to predict whether a customer will subscribe to a term deposit. The target variable "y" is either "yes" or "no". From a business standpoint, I believe I should be more interested in the "yes". The dataset is highly imbalanced with no = 89% and yes = 11%. Note that the "duration" variable should be excluded as duration is only determined AFTER the outcome y is also know.
MY QUESTIONS:
-
I'm a bit confused by the metrics sensitivity and specificity from the model. When I output the confusion matrix, the "sensitivity" appears to be the the proportion of "no" that were classified as "no", while the specificity appears to be the proportion of "yes" that were classified as "yes". Shouldn't this be flipped, shouldn't the sensitivity be the proportion of "yes" that were classified as "yes"?
-
Given the current output, I believe I should be interested in the specificity. Right now the highest specificity I
can achieve is about 70%. While my code below is for the logistic model, I have also tried KNN and Random Forest. Those models while having higher overall accuracy, they perform worse than the logistic model in terms of specificity. Are there additional steps I can take to improve my model (specificity)? One thing I though of was interactions however, I'm not sure how to look for them in a logistic regression.
REPRODUCIBLE CODE:
The code below does the following : load data, split into train/test set, create cv folds, create recipe, create workflow, create tune grid, train model, collect metrics, print confusion matrix
# Packages
library(tidyverse)
library(tidymodels)
# Import Data ##################################################################
temp <- tempfile()
url <-
"https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip"
download.file(url, temp)
unzip("dataset.zip", exdir = "./")
data <- read.csv(unz(temp, "bank-additional/bank-additional-full.csv"),
header = TRUE,
sep = ";")
data <-
data %>%
as_tibble()
# Class Imbalance
data %>% count(y) %>%
mutate(prop = n/sum(n))
# MODELING #####################################################################
# Mutate chr to fct
data <-
data %>%
mutate_if(is.character, as.factor)
# Split data into training and testing
set.seed(125)
split <-
initial_split(data, prop = 0.80, strata = y)
train_data <-
training(split)
test_data <-
testing(split)
# Cross Validation Spec
set.seed(138)
cv <- vfold_cv(data, v = 10)
# Recipe
recipe_logistic <-
recipe(y ~ ., data = train_data) %>%
step_rm(duration) %>%
step_zv(all_predictors()) %>%
step_YeoJohnson(all_numeric()) %>%
step_normalize(all_numeric()) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
themis::step_upsample(y) %>%
prep()
# Model Spec
logistic_model <-
logistic_reg(penalty = tune(), mixture = 1) %>%
set_engine("glmnet") %>%
set_mode("classification")
# Workflow
workflow <-
workflow() %>%
add_model(logistic_model) %>%
add_recipe(recipe_logistic)
# Logistic Regression Grid
grid <-
tibble(penalty = 10^seq(-4, -2,
length.out = 20))
# Train Model
set.seed(194)
results <-
workflow %>%
tune_grid(cv,
grid = grid,
control = control_grid(save_pred = TRUE),
metrics = metric_set(roc_auc, accuracy, sensitivity, specificity))
# Collect Metrics
results %>%
show_best(metric = "roc_auc") %>%
arrange(desc(mean))
# Best Parameters
best_param <-
results %>%
select_best()
# Logistic Regression Confusion Matrix
results %>%
collect_predictions(parameters = best_param) %>%
conf_mat(y, .pred_class) %>%
autoplot(type = "heatmap")+
labs(title = "Confusion Matrix")
Thanks in advance and happy to answer any additional questions.