How to model with a binary response when I want to find best threshold for a predictor


#1

A very statisticy and open question(sorry!).

My data looks like this:

library(dplyr)

mtcars %>% 
  select(vs, wt, carb) %>% 
  mutate(wt = cut(wt, seq(0, 4, 0.1))) %>% 
  head(10) 
#>    vs        wt carb
#> 1   0 (2.6,2.7]    4
#> 2   0 (2.8,2.9]    4
#> 3   1 (2.3,2.4]    1
#> 4   1 (3.2,3.3]    1
#> 5   0 (3.4,3.5]    2
#> 6   1 (3.4,3.5]    1
#> 7   0 (3.5,3.6]    4
#> 8   1 (3.1,3.2]    2
#> 9   1 (3.1,3.2]    2
#> 10  1 (3.4,3.5]    4

Created on 2018-12-20 by the reprex package (v0.2.1)

I want to predict vs for every level of carb, while finding the optimal threshold for wt at each level of carb. How should I approach this?


#2

Can you comment on why you're dichotomizing the wt variable? Are you trying to find the best places to cut wt, or are you trying to find the best probability threshold within each group of carb?


#3

I agree with Alex. Dichotomizing continuous variables is a terrible idea, statistically speaking (there's even a great blog named about this issue):

https://livefreeordichotomize.com

but for some reasons doctors & nutritionists seem to be in love with it.

Are you sure you must dichotomize? If so, why? Also, can you explain the meaning of the various variables, the way they were measured (one measure for subject? Multiple measured for subjects? Measures are temporally correlated or can we assume them as independent?), and the hypothesis you want to test? Also, how do you define "optimal" for the wt threshold?


#4

FWIW this is mtcars, so it's (and, though no need to dichotomize, carburetors definitely aren't continuous…thank god shivers):

A data frame with 32 observations on 11 (numeric) variables.

[, 1] mpg Miles/(US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (1000 lbs)
[, 7] qsec 1/4 mile time
[, 8] vs Engine (0 = V-shaped, 1 = straight)
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburetors

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).


#5

Thanks Mara! Actually the variable being dichotomized is wt, which is continuous. But I should have read the name of the dataset with more attention before replying! Instead I saw carb and I thought:

  • carb = carbohydrates
  • wt = initial weight in 10s of lbs
  • vs = an indicator variable indicating the success of a diet/treatment

I have a crazy imagination! :stuck_out_tongue_winking_eye:

Ok, I take back all my questions concerning the dataset, but two questions still stand:

  • Why dichotomize wt?
  • What does it mean to find the optimal threshold for wt? If I were interested in predicting the probability of vs=1 conditional on wt & carb, I'd just naively fit a logistic regression (as long as data are not linearly separable)

#6

Thank you for taking the time! I'll add a bit more context.

A sample of the actual data:

tibble::tribble(
  ~cancer, ~pirads,              ~psad,
        1,     "5",   1.25862068965517,
        1,     "5",  0.903921568627451,
        1,     "3", 0.0791208791208791,
        0,     "3",  0.171428571428571,
        1,     "4",  0.272222222222222,
        0,     "4",  0.133783783783784,
        0,     "4",  0.103333333333333,
        0,     "3",                0.1,
        1,     "5",  0.744444444444444,
        0,     "5", 0.0476190476190476,
        1,     "4",  0.221739130434783,
        1,     "5",  0.205555555555556,
        0,     "3", 0.0394736842105263,
        1,     "5",   2.30769230769231,
        1,     "5",  0.116279069767442,
        0,     "5",   2.49117647058824,
        1,     "4",  0.220833333333333,
        0,     "5",  0.157142857142857,
        0,     "5", 0.0788732394366197,
        1,     "5",  0.186486486486486
  )

pirads - Likert score for prostate cancer on MR imaging.
psad - PSA density (PSA / prostate volume)

Every row represents one subject.

Not sure if I should, just seemed like a good idea at the time..:slight_smile:

I think I'll have to describe this with terminology I understand, hopefully it'll be easier for you to translate this into statistics than the other way around...

The PI-RADS score is (depending on local preferences) used to decide if a patient should be biopsied or not. Simplified: score 1-2 no biopsy, score 3-5 biopsy.
In this dataset the proportion of cancer for each score is approx 5% for score 1-2(combined), 30% for 3, 60% for 4 and 90% for 5.

In a clinical setting you'd use a threshold of 0.15 for PSA density regardless of PI-RADS score, if it's over this you're very likely to get a bunch of needles up your bum. I'd like to see how the "predictive power" for each level of PI-RADS changes for each 0.01 step increase in PSA density.

Also:

This is true. I couldn't find the source, but there was an assertion that this comes from old "habits" and the fact that we just can't remember 1000 different factors when making a prediction. E.g. TNM staging. Hopefully our AI overlords will help us out with this in the future.

Also also:
Thank you for this, @mara. Getting my data into the interwebs has been an everlasting pain since I started using R this summer!


#7

:fearful: prostate cancer datasets give me the willies, but I'll try to answer nonetheless:

# in answer to https://community.rstudio.com/t/how-to-model-with-a-binary-response-when-i-want-to-find-best-threshold-for-a-predictor/20314/
if (!require(dplyr)) {
  install.packages("dplyr")
  library(dplyr)
}
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
if (!require(ggplot2)) {
  install.packages("ggplot2")
  library(ggplot2)
}
#> Loading required package: ggplot2

psad_sample <- tribble(
  ~cancer, ~pirads,        ~psad,
  1,     "5",   1.25862068965517,
  1,     "5",  0.903921568627451,
  1,     "3", 0.0791208791208791,
  0,     "3",  0.171428571428571,
  1,     "4",  0.272222222222222,
  0,     "4",  0.133783783783784,
  0,     "4",  0.103333333333333,
  0,     "3",                0.1,
  1,     "5",  0.744444444444444,
  0,     "5", 0.0476190476190476,
  1,     "4",  0.221739130434783,
  1,     "5",  0.205555555555556,
  0,     "3", 0.0394736842105263,
  1,     "5",   2.30769230769231,
  1,     "5",  0.116279069767442,
  0,     "5",   2.49117647058824,
  1,     "4",  0.220833333333333,
  0,     "5",  0.157142857142857,
  0,     "5", 0.0788732394366197,
  1,     "5",  0.186486486486486
)

p <- ggplot(psad_sample, aes(x = pirads, y = psad, color = factor(cancer))) +
  geom_point()
print(p)


psad_model <- glm(cancer ~ psad, family = binomial(), data = psad_sample)
pirads_model <- glm(cancer ~ pirads, family = binomial(), data = psad_sample)
print(summary(psad_model))
#> 
#> Call:
#> glm(formula = cancer ~ psad, family = binomial(), data = psad_sample)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -1.7019  -1.1870   0.8513   1.1432   1.1723  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.02619    0.54989  -0.048    0.962
#> psad         0.48436    0.70888   0.683    0.494
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 27.526  on 19  degrees of freedom
#> Residual deviance: 27.010  on 18  degrees of freedom
#> AIC: 31.01
#> 
#> Number of Fisher Scoring iterations: 4
print(summary(pirads_model))
#> 
#> Call:
#> glm(formula = cancer ~ pirads, family = binomial(), data = psad_sample)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -1.4224  -1.3537   0.9508   0.9508   1.6651  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)
#> (Intercept)   -1.099      1.155  -0.951    0.341
#> pirads4        1.504      1.472   1.022    0.307
#> pirads5        1.658      1.314   1.262    0.207
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 27.526  on 19  degrees of freedom
#> Residual deviance: 25.649  on 17  degrees of freedom
#> AIC: 31.649
#> 
#> Number of Fisher Scoring iterations: 4

glm_model <- glm(cancer ~ pirads + psad, family = binomial(), data = psad_sample)
glm_model_with_interactions <- glm(cancer ~ pirads * psad, family = binomial(), data = psad_sample)
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
print(anova(psad_model, glm_model, glm_model_with_interactions, test ="Chisq"))
#> Analysis of Deviance Table
#> 
#> Model 1: cancer ~ psad
#> Model 2: cancer ~ pirads + psad
#> Model 3: cancer ~ pirads * psad
#>   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
#> 1        18     27.010                       
#> 2        16     25.543  2   1.4669  0.48026  
#> 3        14     18.655  2   6.8883  0.03193 *
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Looking at your sample data set, data are not linearly separable. Good.
  • fitting a pirads-only model and a psad-only model, the residual deviance of the pirads-only model is actually lower than the deviance of the psad-only model. Even worse, they both are not significant. Uhhhh, not exactly comforting, given that (as far as I know, and I know very little about this stuff) testing for PSA is a widely used screening method, but I'm not a doctor and the dataset is very small, so hopefully this doesn't come as a surprise to you.
  • We fit three nested models: psad-only, psad & pirads (the usual multiple (generalized) linear model), and psad, pirads and interactions, and we run an asymptotic likelihood-ratio test (I'm sure there are more advanced exact tests for Binomial GLMs, but I don't know about them, or you could build your own exact test using permutations). The results are actually quite interesting: controlling for pirads too, once you control for psad, is useless (test is not significant), so it looks like once you take into account the information about psad, adding the information about pirads gives an improvement which is not distinguishable from what you would get by random chance. But, if you include interactions, the difference is significant (p-value = 0.03193). Note that this is an ANOVA test for nested models, so (at least asymptotically, when the distribution of the LR statistics approaches a \chi^2), it's accounting for multiple comparisons. However, with a more flexible model, your dataset now may becomes separable, or close to. As a matter of fact, the warning #> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred seems to indicate problems, and if you look at the standard errors they are actually huge, suggesting numerical and/or statistical issues with this model
> summary(glm_model_with_interactions)

Call:
glm(formula = cancer ~ pirads * psad, family = binomial(), data = psad_sample)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.55818  -0.78669   0.00006   0.94348   1.60254  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)  -7.787e-03  2.633e+00  -0.003    0.998
pirads4      -7.441e+01  1.537e+04  -0.005    0.996
pirads5       4.364e-01  2.763e+00   0.158    0.875
psad         -1.203e+01  2.797e+01  -0.430    0.667
pirads4:psad  4.324e+02  8.135e+04   0.005    0.996
pirads5:psad  1.221e+01  2.798e+01   0.436    0.663

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 27.526  on 19  degrees of freedom
Residual deviance: 18.655  on 14  degrees of freedom
AIC: 30.655

Number of Fisher Scoring iterations: 18

Thus, the logistic regression fit may not be reliable anymore. In other words, don't draw any conclusions based on this elementary analysis, but repeat it on your full dataset. If you still get these numerical issues, then I suggest you post a new question here or on Cross Validated on how to deal with fitting issues for the logistic regression model. Or, you could change model/add regularization to bypass the issue.


Conclusions and future work

  • My naive impression is that psad & pirads contain somewhat the same information about cancer, so once one of either variables is in the model, adding the other one doesn't increase the predictive accuracy of the model by much
  • Including interactions seems to change this fact, but before drawing any solid conclusions you need to run the model with interactions and verify if it achieved perfect separation or not. I don't think so, because the residual deviance is still relatively large, but you may want to check.
  • Of course, all this is very tentative and nothing can be really said until you rerun the analysis on the full data set
  • Finally, if (big IF!) the results from this initial analysis are confirmed by an analysis on the full dataset, the residual deviance is still very large, i.e., even the model with interactions does a crappy job of predicting cancer. You may want to try some different models (e.g., SVMs or XGBoost), but if you do so, be sure to use cross-validation to avoid being fooled by overfitting. How large is your full data set?

Let me know if this helped, and how the results change on the complete dataset.


#8

Thank you for this! It's absurdly helpful seing how someone that actually knows what they're doing would approach this..:slight_smile: Given the date and the (not-so-surprising) fact that I'll have to read up on some of this it might be a few days before I'm able to reply properly, but...I'll be back.

Happy holidays!


#9

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.