I am trying to follow along with a much simpler data set but cant seem to create the p-value needed for the plot on the y-axis in Figure 5.4. I have managed to create the log odds ratio as defined by Dr Julia Silge and Dr David Robinson in their book Text Mining with R
Could someone point me in the right direction. Below is my current toy example.
options(scipen = 999)
library(tidyverse)
library(janitor)
url="http://freakonometrics.free.fr/german_credit.csv"
mydf = read.csv(url, header = TRUE, sep = ",") %>%
clean_names() %>%
mutate(tgt = as.character(creditability),
payment_status_of_previous_credit = as.character(payment_status_of_previous_credit))
# Here we build up the log ratios.
# I just randomly picked a column for explanatory purposes.
credit_ratios <- mydf %>%
count(payment_status_of_previous_credit, tgt) %>%
group_by(tgt) %>%
ungroup() %>%
spread(tgt, n, fill = 0) %>%
mutate_if(is.numeric, list(~(. + 1) / (sum(.) + 1))) %>%
rename(previous_credit_status = 1, bad = 2, good = 3) %>%
mutate(logratio = log(bad / good)) %>%
arrange(desc(logratio))
# about equally likely to have
# a good and bad customer who had a credit rating previously as 2
credit_ratios %>%
arrange(abs(logratio))
Based on your advice I have created the following. Does it look correct to you?
I think i would need to maybe use some sort of penalized logistic regression to deal with things like factors with 1 item but the below is just to get an idea
options(scipen = 999)
library(tidyverse)
library(janitor)
library(tidymodels)
url="http://freakonometrics.free.fr/german_credit.csv"
mydf = read.csv(url, header = TRUE, sep = ",") %>%
clean_names() %>%
mutate(tgt = as.character(creditability),
payment_status_of_previous_credit = as.character(payment_status_of_previous_credit))
# Generate the logit formula where we try and classify the tgt from the field payment_status_of_previous_credit
logit_formula = glm(formula = as.factor(tgt) ~ payment_status_of_previous_credit,
data = mydf, family = binomial)
# We then pull the co-efficents using broom
coefficents <- broom::tidy(logit_formula) %>%
janitor::clean_names()
# I want to get the number of people per group
diff_terms <- mydf %>%
group_by(payment_status_of_previous_credit) %>%
mutate(term = str_c("payment_status_of_previous_credit", payment_status_of_previous_credit, sep='')) %>%
group_by(term) %>%
summarise(ppl = n())
# Next we calculate the FDR but i don't apply any filtering because there isnt enough data points
coefficents %>%
mutate(FDR = p.adjust(p_value, method = "fdr")) %>%
inner_join(diff_terms) %>%
ggplot(aes(x=estimate, y=-log10(FDR), size = ppl)) +
geom_point(alpha = 0.3) +
theme_minimal()
The code reproduces, though I'm not sure what the FDR adjusted p-values contribute; credit history 1 is still high and the others are still quite low and only their relative magnitudes have changed.
My only suggestion is to tie off the logistic model
options(scipen = 999)
library(tidyverse)
library(janitor)
#>
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#>
#> chisq.test, fisher.test
library(tidymodels)
url="http://freakonometrics.free.fr/german_credit.csv"
mydf = read.csv(url, header = TRUE, sep = ",") %>%
clean_names() %>%
mutate(tgt = as.character(creditability),
payment_status_of_previous_credit = as.character(payment_status_of_previous_credit))
# Generate the logit formula where we try and classify the tgt from the field payment_status_of_previous_credit
logit_formula = glm(formula = as.factor(tgt) ~ payment_status_of_previous_credit,
data = mydf, family = binomial)
# We then pull the co-efficents using broom
coefficents <- broom::tidy(logit_formula) %>%
janitor::clean_names()
# I want to get the number of people per group
diff_terms <- mydf %>%
group_by(payment_status_of_previous_credit) %>%
mutate(term = str_c("payment_status_of_previous_credit", payment_status_of_previous_credit, sep='')) %>%
group_by(term) %>%
summarise(ppl = n())
#> `summarise()` ungrouping output (override with `.groups` argument)
# Next we calculate the FDR but i don't apply any filtering because there isnt enough data points
coefficents %>%
mutate(FDR = p.adjust(p_value, method = "fdr")) %>%
inner_join(diff_terms) %>%
ggplot(aes(x=estimate, y=-log10(FDR), size = ppl)) +
geom_point(alpha = 0.3) +
theme_minimal()
#> Joining, by = "term"