(Logistic Regression) Unexpected odds ratio when looking at a table of input data before modeling

Not sure if this is the right forum for this kind of question but here goes.

I have a data frame with 2 columns: churned & auto_renew. If churned is 1 it denotes that a user left us, 0 otherwise. auto_renew is also boolean with a value of True if a users account is on auto renew, False otherwise.

Before doing any modeling, here is the data grouped by churned and auto renew:

| Auto_Renew | NotChurn | Churn  | Rate |
|------------|----------|--------|------|
| False      | 280335   | 219241 | 0.44 |
| True       | 1314651  | 185773 | 0.12 |

So, I can see from this that being on auto renew (True) is associated with less churn.

I created a classification model using logistic regression with one feature: churn ~ auto_renewTRUE

Given the table above, the expectation is that the odds ratio for this coefficient would be negative in that being on auto renew should lower probability of churn. However...

Using caret

my_summary  <- function(data, lev = NULL, model = NULL){
  a1 <- defaultSummary(data, lev, model)
  b1 <- twoClassSummary(data, lev, model) # Regular ROC AUC
  c1 <- prSummary(data, lev, model) # precision recall AUC
  out <- c(a1, b1, c1)
  out}

## tuning & parameters
set.seed(123)
train_control <- trainControl(
  method = "cv", # cross validation
  number = 5, # 5 folds
  savePredictions = TRUE,
  verboseIter = TRUE, 
  classProbs = TRUE, # will use these for model plots later
  summaryFunction = my_summary
)

# just single predictor variable auto_renew
sink_model = train(
  x = training_data %>% select(auto_renewal), # TRUE if on auto renew, FALSE otherwise
  y = target_churned, # X1 if churned, X0 otherwise
  trControl = train_control,
  method = "glm", # logistic regression
  family = "binomial",
  metric = "AUC"
)

Here is a summary of the model

> summary(sink_model)

Call:
NULL

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0440   0.5141   0.5141   0.5141   1.0750  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)       0.245814   0.002851   86.22   <2e-16 ***
auto_renewal_flag 1.710987   0.003778  452.90   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2015433  on 1999999  degrees of freedom
Residual deviance: 1808753  on 1999998  degrees of freedom
AIC: 1808757

Number of Fisher Scoring iterations: 4

If I calculate the odds ratio for this coefficient exp(1.710987) I get 5.5.

If I'm understanding this correctly, the model is telling me that being on auto_renew (auto_renew=TRUE) then the odds of target variable being churn are 5.5 times more than when auto_renewal if False. This is the opposite of expectations.

One thing I noticed when caret finished fitting the model was this message:
"There were missing values in resampled performance measures.Aggregating results"

Not sure what this means. Note there are no missing values/NAs in the data.

Here is how the training data have been encoded by caret

> sink_model$trainingData %>% glimpse()
Observations: 2,000,000
Variables: 2
$ auto_renewal_flag <lgl> TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE,…
$ .outcome          <fct> X0, X0, X1, X0, X0, X0, X1, X0, X0, X0, X0, X0, X0, X0, X…

Have I interpreted by model correctly? How can it be that this logistic regression with a single feature associates being on auto renew with a higher probability of churn?

It is probably related to how glm models the data in terms of the probability of the second level of the outcome factor variable (X1 maybe?).

We don't have the raw data so it is hard to say. It usually implies that the performance estimates failed for some reason. For example, the model may have predicted all samples as a single class so something like sensitivity or specificity could not be computed etc.

Hi Max.

When you say second level.. do you mean e.g.
levels(target_churned)[1] "X1" "X0"

Would the model be predicting towards X0 in this case? When I prepossessed the data I deliberately set the levels of feature "target_churn" with levels = c(X1, X0) on the belief that this ensured that the first level would be X1, which I thought meant that caret would view this first level as the True class.

Here is a sample of data if it helps to diagnose.

On this sample, the following code chunks should be able to replicate the issue that I'm seeing:

example_sample %>% 
  group_by(auto_renewal_flag, target_churned) %>% 
  summarise(size = n()) %>% 
  spread(target_churned, size) %>% 
  mutate(Rate = round(X1 / (X0 + X1), 2)) %>% 
  select(auto_renewal_flag, X0, X1, Rate)

Auto renew flag on True is associated with less churn per this table above.

Then a model:

my_summary  <- function(data, lev = NULL, model = NULL){
  a1 <- defaultSummary(data, lev, model)
  b1 <- twoClassSummary(data, lev, model) # Regular ROC AUC
  c1 <- prSummary(data, lev, model) # precision recall AUC
  out <- c(a1, b1, c1)
  out}

## tuning & parameters
set.seed(123)
train_control <- trainControl(
  method = "cv", # cross validation
  number = 5, # 5 folds
  savePredictions = TRUE,
  verboseIter = TRUE, 
  classProbs = TRUE, # will use these for model plots later
  summaryFunction = my_summary
)

sink_model = train(
  x = example_sample %>% select(auto_renewal_flag),
  y = example_sample$target_churned,
  trControl = train_control,
  method = "glm", # logistic regression
  family = "binomial",
  metric = "AUC"
)

Then summary(sink_model):

> summary(sink_model)

Call:
NULL

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0123   0.5322   0.5322   0.5322   1.0674  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)            0.26448    0.04027   6.568 5.11e-11 ***
auto_renewal_flagTRUE  1.61854    0.05279  30.661  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 10223.2  on 9999  degrees of freedom
Residual deviance:  9282.1  on 9998  degrees of freedom
AIC: 9286.1

Number of Fisher Scoring iterations: 4

And the exp of the coefficient:
exp(1.61854) = ~5

And if it's important:
levels(example_sample$target_churned) ] "X1" "X0"

It's this piece which is really confusing me: exp(1.61854) = ~5. My understanding is that it means that the odds of churn actually increase when auto renewal is True.

Any help or guidance very much welcome.

I reordered the levels per your comment and this appears to have solved the issue. GLM fits to the second factor level, which was X0 in my case not X1.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.