Not sure if this is the right forum for this kind of question but here goes.
I have a data frame with 2 columns: churned & auto_renew. If churned is 1 it denotes that a user left us, 0 otherwise. auto_renew is also boolean with a value of True if a users account is on auto renew, False otherwise.
Before doing any modeling, here is the data grouped by churned and auto renew:
| Auto_Renew | NotChurn | Churn | Rate |
|------------|----------|--------|------|
| False | 280335 | 219241 | 0.44 |
| True | 1314651 | 185773 | 0.12 |
So, I can see from this that being on auto renew (True) is associated with less churn.
I created a classification model using logistic regression with one feature: churn ~ auto_renewTRUE
Given the table above, the expectation is that the odds ratio for this coefficient would be negative in that being on auto renew should lower probability of churn. However...
Using caret
my_summary <- function(data, lev = NULL, model = NULL){
a1 <- defaultSummary(data, lev, model)
b1 <- twoClassSummary(data, lev, model) # Regular ROC AUC
c1 <- prSummary(data, lev, model) # precision recall AUC
out <- c(a1, b1, c1)
out}
## tuning & parameters
set.seed(123)
train_control <- trainControl(
method = "cv", # cross validation
number = 5, # 5 folds
savePredictions = TRUE,
verboseIter = TRUE,
classProbs = TRUE, # will use these for model plots later
summaryFunction = my_summary
)
# just single predictor variable auto_renew
sink_model = train(
x = training_data %>% select(auto_renewal), # TRUE if on auto renew, FALSE otherwise
y = target_churned, # X1 if churned, X0 otherwise
trControl = train_control,
method = "glm", # logistic regression
family = "binomial",
metric = "AUC"
)
Here is a summary of the model
> summary(sink_model)
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0440 0.5141 0.5141 0.5141 1.0750
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.245814 0.002851 86.22 <2e-16 ***
auto_renewal_flag 1.710987 0.003778 452.90 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2015433 on 1999999 degrees of freedom
Residual deviance: 1808753 on 1999998 degrees of freedom
AIC: 1808757
Number of Fisher Scoring iterations: 4
If I calculate the odds ratio for this coefficient exp(1.710987)
I get 5.5.
If I'm understanding this correctly, the model is telling me that being on auto_renew (auto_renew=TRUE) then the odds of target variable being churn are 5.5 times more than when auto_renewal if False. This is the opposite of expectations.
One thing I noticed when caret finished fitting the model was this message:
"There were missing values in resampled performance measures.Aggregating results"
Not sure what this means. Note there are no missing values/NAs in the data.
Here is how the training data have been encoded by caret
> sink_model$trainingData %>% glimpse()
Observations: 2,000,000
Variables: 2
$ auto_renewal_flag <lgl> TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE,…
$ .outcome <fct> X0, X0, X1, X0, X0, X0, X1, X0, X0, X0, X0, X0, X0, X0, X…
Have I interpreted by model correctly? How can it be that this logistic regression with a single feature associates being on auto renew with a higher probability of churn?