Generating a p-value for odds ratio

Hi,

In Section 5.6 of the book Feature Engineering and Selection by Max Kuhn and Kjell Johnson there is an explanation on how to create a volcano plot by first calculating the odds-ratio for the categorical variables and their associated p-values and then calculating the false discovery rate.

I am trying to follow along with a much simpler data set but cant seem to create the p-value needed for the plot on the y-axis in Figure 5.4. I have managed to create the log odds ratio as defined by Dr Julia Silge and Dr David Robinson in their book Text Mining with R

\text{log odds ratio} = \ln{\left(\frac{\left[\frac{n+1}{\text{total}+1}\right]_\text{good}}{\left[\frac{n+1}{\text{total}+1}\right]_\text{bad}}\right)}

Could someone point me in the right direction. Below is my current toy example.


options(scipen = 999)
library(tidyverse)
library(janitor)

url="http://freakonometrics.free.fr/german_credit.csv"
mydf = read.csv(url, header = TRUE, sep = ",") %>% 
  clean_names() %>% 
  mutate(tgt = as.character(creditability),
         payment_status_of_previous_credit = as.character(payment_status_of_previous_credit))

# Here we build up the log ratios. 
# I just randomly picked a column for explanatory purposes.
credit_ratios <- mydf %>%
  count(payment_status_of_previous_credit, tgt) %>%
  group_by(tgt) %>%
  ungroup() %>%
  spread(tgt, n, fill = 0) %>%
  mutate_if(is.numeric, list(~(. + 1) / (sum(.) + 1))) %>%
  rename(previous_credit_status = 1, bad = 2, good = 3) %>% 
  mutate(logratio = log(bad / good)) %>%
  arrange(desc(logratio))

# about equally likely to have 
# a good and bad customer who had a credit rating previously as 2
credit_ratios %>% 
  arrange(abs(logratio))

Thanks very much for your time

Here is how you would get p-values

> glm(creditability ~ payment_status_of_previous_credit, data = mydf, family = "binomial") -> a
> logLik(a)
'log Lik.' -580.6307 (df=5)
> summary(a)

Call:
glm(formula = creditability ~ payment_status_of_previous_credit, 
    family = "binomial", data = mydf)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8805  -1.0579   0.6117   0.8764   1.4006  

Coefficients:
                                   Estimate Std. Error z value
(Intercept)                         -0.5108     0.3266  -1.564
payment_status_of_previous_credit1   0.2231     0.4359   0.512
payment_status_of_previous_credit2   1.2698     0.3396   3.739
payment_status_of_previous_credit3   1.2730     0.3988   3.192
payment_status_of_previous_credit4   2.0919     0.3616   5.784
                                        Pr(>|z|)    
(Intercept)                             0.117799    
payment_status_of_previous_credit1      0.608703    
payment_status_of_previous_credit2      0.000185 ***
payment_status_of_previous_credit3      0.001413 ** 
payment_status_of_previous_credit4 0.00000000728 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1221.7  on 999  degrees of freedom
Residual deviance: 1161.3  on 995  degrees of freedom
AIC: 1171.3

Number of Fisher Scoring iterations: 4

But the volcano plot cited from the first reference doesn't use it. The x-axis is odd-ratio and the y-axis is -log10(FDR)

Hi @technocrat

Thank you for your reply.

Based on your advice I have created the following. Does it look correct to you?
I think i would need to maybe use some sort of penalized logistic regression to deal with things like factors with 1 item but the below is just to get an idea


options(scipen = 999)
library(tidyverse)
library(janitor)
library(tidymodels)

url="http://freakonometrics.free.fr/german_credit.csv"
mydf = read.csv(url, header = TRUE, sep = ",") %>% 
  clean_names() %>% 
  mutate(tgt = as.character(creditability),
         payment_status_of_previous_credit = as.character(payment_status_of_previous_credit))

# Generate the logit formula where we try and classify the tgt from the field payment_status_of_previous_credit
logit_formula = glm(formula = as.factor(tgt) ~ payment_status_of_previous_credit, 
                    data = mydf, family = binomial)

# We then pull the co-efficents using broom
coefficents <- broom::tidy(logit_formula) %>% 
  janitor::clean_names()

# I want to get the number of people per group
diff_terms <- mydf %>% 
  group_by(payment_status_of_previous_credit) %>% 
  mutate(term =  str_c("payment_status_of_previous_credit", payment_status_of_previous_credit, sep='')) %>% 
  group_by(term) %>% 
  summarise(ppl = n())

# Next we calculate the FDR but i don't apply any filtering because there isnt enough data points
coefficents %>% 
  mutate(FDR = p.adjust(p_value, method = "fdr")) %>% 
  inner_join(diff_terms) %>% 
  ggplot(aes(x=estimate, y=-log10(FDR), size = ppl)) +
  geom_point(alpha = 0.3) +
  theme_minimal() 

The code reproduces, though I'm not sure what the FDR adjusted p-values contribute; credit history 1 is still high and the others are still quite low and only their relative magnitudes have changed.

My only suggestion is to tie off the logistic model

options(scipen = 999)
library(tidyverse)
library(janitor)
#> 
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#> 
#>     chisq.test, fisher.test
library(tidymodels)

url="http://freakonometrics.free.fr/german_credit.csv"
mydf = read.csv(url, header = TRUE, sep = ",") %>% 
  clean_names() %>% 
  mutate(tgt = as.character(creditability),
         payment_status_of_previous_credit = as.character(payment_status_of_previous_credit))

# Generate the logit formula where we try and classify the tgt from the field payment_status_of_previous_credit
logit_formula = glm(formula = as.factor(tgt) ~ payment_status_of_previous_credit, 
                    data = mydf, family = binomial)

# We then pull the co-efficents using broom
coefficents <- broom::tidy(logit_formula) %>% 
  janitor::clean_names()

# I want to get the number of people per group
diff_terms <- mydf %>% 
  group_by(payment_status_of_previous_credit) %>% 
  mutate(term =  str_c("payment_status_of_previous_credit", payment_status_of_previous_credit, sep='')) %>% 
  group_by(term) %>% 
  summarise(ppl = n())
#> `summarise()` ungrouping output (override with `.groups` argument)

# Next we calculate the FDR but i don't apply any filtering because there isnt enough data points
coefficents %>% 
  mutate(FDR = p.adjust(p_value, method = "fdr")) %>% 
  inner_join(diff_terms) %>% 
  ggplot(aes(x=estimate, y=-log10(FDR), size = ppl)) +
  geom_point(alpha = 0.3) +
  theme_minimal() 
#> Joining, by = "term"

library(MASS)
#> 
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#> 
#>     select
(waldscore <- confint.default(logit_formula))
#>                                         2.5 %    97.5 %
#> (Intercept)                        -1.1509472 0.1292959
#> payment_status_of_previous_credit1 -0.6311849 1.0774720
#> payment_status_of_previous_credit2  0.6041268 1.9354829
#> payment_status_of_previous_credit3  0.4913169 2.0546144
#> payment_status_of_previous_credit4  1.3830670 2.8006611
(vmscore <- confint(logit_formula))
#> Waiting for profiling to be done...
#>                                         2.5 %    97.5 %
#> (Intercept)                        -1.1738117 0.1176409
#> payment_status_of_previous_credit1 -0.6286570 1.0876744
#> payment_status_of_previous_credit2  0.6152303 1.9566827
#> payment_status_of_previous_credit3  0.5034365 2.0736977
#> payment_status_of_previous_credit4  1.3948185 2.8208648
library(profileModel)
(raoscore <- profConfint(profileModel(logit_formula, quantile=qchisq(0.95, 1), objective = "RaoScoreStatistic", X = model.matrix(logit_formula))))
#> Preliminary iteration ..... Done
#> 
#> Profiling for parameter (Intercept) ... Done
#> Profiling for parameter payment_status_of_previous_credit1 ... Done
#> Profiling for parameter payment_status_of_previous_credit2 ... Done
#> Profiling for parameter payment_status_of_previous_credit3 ... Done
#> Profiling for parameter payment_status_of_previous_credit4 ... Done
#>                                         Lower     Upper
#> (Intercept)                        -1.1404745 0.1188233
#> payment_status_of_previous_credit1 -0.6240158 1.0695355
#> payment_status_of_previous_credit2  0.6134541 1.9256681
#> payment_status_of_previous_credit3  0.4979180 2.0476886
#> payment_status_of_previous_credit4  1.3909966 2.7922873
#> attr(,"profileModel object")
#> profileModel(logit_formula, quantile = qchisq(0.95, 1), objective = "RaoScoreStatistic", 
#>     X = model.matrix(logit_formula))

vcov(logit_formula)
#>                                    (Intercept)
#> (Intercept)                          0.1066667
#> payment_status_of_previous_credit1  -0.1066667
#> payment_status_of_previous_credit2  -0.1066667
#> payment_status_of_previous_credit3  -0.1066667
#> payment_status_of_previous_credit4  -0.1066667
#>                                    payment_status_of_previous_credit1
#> (Intercept)                                                -0.1066667
#> payment_status_of_previous_credit1                          0.1900000
#> payment_status_of_previous_credit2                          0.1066667
#> payment_status_of_previous_credit3                          0.1066667
#> payment_status_of_previous_credit4                          0.1066667
#>                                    payment_status_of_previous_credit2
#> (Intercept)                                                -0.1066667
#> payment_status_of_previous_credit1                          0.1066667
#> payment_status_of_previous_credit2                          0.1153539
#> payment_status_of_previous_credit3                          0.1066667
#> payment_status_of_previous_credit4                          0.1066667
#>                                    payment_status_of_previous_credit3
#> (Intercept)                                                -0.1066667
#> payment_status_of_previous_credit1                          0.1066667
#> payment_status_of_previous_credit2                          0.1066667
#> payment_status_of_previous_credit3                          0.1590476
#> payment_status_of_previous_credit4                          0.1066667
#>                                    payment_status_of_previous_credit4
#> (Intercept)                                                -0.1066667
#> payment_status_of_previous_credit1                          0.1066667
#> payment_status_of_previous_credit2                          0.1066667
#> payment_status_of_previous_credit3                          0.1066667
#> payment_status_of_previous_credit4                          0.1307819
library(ResourceSelection)
#> ResourceSelection 0.3-5   2019-07-22
hoslem.test(logit_formula$y, fitted(logit_formula), g=10)
#> 
#>  Hosmer and Lemeshow goodness of fit (GOF) test
#> 
#> data:  logit_formula$y, fitted(logit_formula)
#> X-squared = 0.000000000000000000000025938, df = 8, p-value = 1

Created on 2020-10-29 by the reprex package (v0.3.0.9001)

1 Like

Thank you very much for your help @technocrat

Have a nice weekend

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.