# Risk calculation

I need help, I just got lost and I don't know how to conduct the statistical analysis. I have a group of patients taking the specific medication and in some of them the discoloration on hard palate mucosa occurs. I want to check how high is the risk of developing the discoloration based on the median dose and median timing of the therapy, is this possible? 0- means there is no discoloration, 1 is consistent with discoloration

Gender Age Dose MG Discoloration Time in months
M 62 780800 0 64
M 45 1777200 0 146
M 71 565200 0 47
F 70 213200 0 33
F 57 2446800 1 25
M 76 389200 0 32
M 69 608800 0 50
F 69 640000 1 173
M 53 308400 0 26
F 55 754800 1 64
M 69 281000 0 25
M 74 339400 0 52
F 81 264400 0 43
F 63 446400 0 157
F 70 736000 0 61
M 68 315300 0 26
F 67 1319100 0 113
F 74 942400 0 78
F 69 1057600 0 87
M 66 1512400 1 123
F 70 1126400 1 93
M 46 438000 0 43
F 58 366400 0 31
F 69 528400 0 44
F 53 296400 0 25
M 50 1032000 0 85
M 62 2452000 1 202
F 60 1304500 1 113
F 60 1372400 0 113
F 65 851200 0 70
M 47 254000 0 21
M 39 404800 0 34
F 81 2236300 0 30
F 81 1207100 1 184
F 54 1557100 0 123
F 63 2931200 0 227
F 76 368600 0 29
F 71 2646600 1 238
F 66 2117300 1 162
M 61 668000 0 55

You can regress Discoloration on Dose and time in months. This is called a linear probability model and gives you a formula which predicts the probability of discoloration for any values of Dose and time in month that you would like.

Or instead of a linear probability model you can use what's called a logit or a probit, which also gives a probability prediction based on a nonlinear model.

Call: glm(formula = Data$Discoloration ~ Data$Dose, family = binomial)

Coefficients:
(Intercept) Data$Dose -2.811e+00 1.488e-06 Degrees of Freedom: 39 Total (i.e. Null); 38 Residual Null Deviance: 44.99 Residual Deviance: 36.1 AIC: 40.1 How can I get a p-value from this...? And how to read the coefficients? You can get the p-values from result <- glm(formula = Data$Discoloration ~ Data\$Dose, family = binomial)
summary(result)


Here's a link on interpreting the results, which is not straightforward, Introduction to R.

The short answer is that these results provide weak evidence for the occurrence of discoloration as affected by the other variables.

d <- data.frame(
Gender = as.factor(c(
"M", "M", "M", "F", "F", "M",
"M", "F", "M", "F", "M", "M", "F", "F", "F", "M", "F", "F",
"F", "M", "F", "M", "F", "F", "F", "M", "M", "F", "F",
"F", "M", "M", "F", "F", "F", "F", "F", "F", "F", "M"
)),
Age = c(
62, 45, 71, 70, 57, 76,
69, 69, 53, 55, 69, 74, 81, 63, 70, 68, 67, 74,
69, 66, 70, 46, 58, 69, 53, 50, 62, 60, 60,
65, 47, 39, 81, 81, 54, 63, 76, 71, 66, 61
),
Dose.MG = c(
780800, 1777200, 565200,
213200, 2446800, 389200, 608800, 640000, 308400,
754800, 281000, 339400, 264400, 446400, 736000, 315300,
1319100, 942400, 1057600, 1512400, 1126400, 438000,
366400, 528400, 296400, 1032000, 2452000, 1304500,
1372400, 851200, 254000, 404800, 2236300, 1207100,
1557100, 2931200, 368600, 2646600, 2117300,
668000
),
Discooration = c(
0, 0, 0, 0, 1, 0, 0, 1,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0,
0, 1, 1, 0
),
Time.in.months = c(
64, 146, 47, 33, 25, 32,
50, 173, 26, 64, 25, 52, 43, 157, 61, 26, 113,
78, 87, 123, 93, 43, 31, 44, 25, 85, 202, 113,
113, 70, 21, 34, 30, 184, 123, 227, 29, 238,
162, 55
)
)
full_fit <- glm(Discooration ~.,d,family=binomial)
summary(full_fit)
#>
#> Call:
#> glm(formula = Discooration ~ ., family = binomial, data = d)
#>
#> Coefficients:
#>                  Estimate Std. Error z value Pr(>|z|)
#> (Intercept)    -4.214e+00  3.707e+00  -1.137   0.2557
#> GenderM        -7.634e-01  1.102e+00  -0.693   0.4883
#> Age             1.487e-02  5.151e-02   0.289   0.7728
#> Dose.MG         9.029e-07  6.926e-07   1.304   0.1924
#> Time.in.months  1.389e-02  8.434e-03   1.647   0.0995 .
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#>     Null deviance: 44.987  on 39  degrees of freedom
#> Residual deviance: 32.259  on 35  degrees of freedom
#> AIC: 42.259
#>
#> Number of Fisher Scoring iterations: 5


Created on 2023-11-19 with reprex v2.0.2

The logistic regression model summary can be interpreted as follows:

## Interpretation of Coefficients

The coefficients (Estimate) represent the change in the log-odds of the outcome for a one-unit increase in the predictor variable, assuming all other variables are held constant.

• GenderM: The coefficient of -0.7634 suggests that being male (GenderM) decreases the log-odds of discoloration by 0.7634 units compared to the reference category (presumably female), assuming all other variables are held constant.
• Age: The coefficient of 0.01487 suggests that for each additional year of age, the log-odds of discoloration increase by 0.01487 units, assuming all other variables are held constant.
• Dose.MG: The coefficient of 0.0000009029 suggests that for each additional milligram of the dose, the log-odds of discoloration increase by 0.0000009029 units, assuming all other variables are held constant.
• Time.in.months: The coefficient of 0.01389 suggests that for each additional month, the log-odds of discoloration increase by 0.01389 units, assuming all other variables are held constant.

## Significance of Coefficients

The Pr(>|z|) column represents the p-values associated with the coefficients. If the p-value is less than a certain significance level (commonly 0.05), this indicates that the predictor variable has a statistically significant relationship with the response variable in the model. The term "significant" should be understood to indicate a degree of confidence that the test statistics are not simply a result of random variation.

• GenderM: The p-value of 0.4883 suggests that gender is not statistically significant in predicting discoloration at the 0.05 significance level.
• Age: The p-value of 0.7728 suggests that age is not statistically significant in predicting discoloration at the 0.05 significance level.
• Dose.MG: The p-value of 0.1924 suggests that the dose is not statistically significant in predicting discoloration at the 0.05 significance level.
• Time.in.months: The p-value of 0.0995 suggests that time in months is not statistically significant in predicting discoloration at the 0.05 significance level, but it is significant at the 0.1 level.

## Model Fit Statistics

• Null deviance: This is a measure of how well the response variable is predicted by a model that includes only the intercept (no predictors). The null deviance in your model is 44.987 on 39 degrees of freedom.
• Residual deviance: This is a measure of how well the response variable is predicted by the model. The residual deviance in your model is 32.259 on 35 degrees of freedom. The decrease in deviance when going from the null model to your model suggests that your model is an improvement over the null model.
• AIC: The Akaike Information Criterion (AIC) is a measure of the relative quality of statistical models. Lower values are better. Your model has an AIC of 42.259. Without another model to compare it to, it's hard to interpret this value in isolation.

Following along with what @technocrat says, the problem is that the dose and the time in months are pretty highly correlated, making it very difficult to tell which is responsible for the discoloration. If you had a lot more data you would probably get more clear results.

Ok, thank you so much! This is extremely helpful! So basically I cannot say if any specific dose of the medication or duration of the therapy increases the risk of getting the discoloration. But the higher the dose and longer therapy time the chances of getting a discoloration increase. Am I right? This is all data that I have. I'm just wondering cause I'm the absolute beginner in R.

That's probably right, but I don't think that you have statistically significant evidence for the claim.

In thinking about a statistical test, it's good to start with choosing \alpha, conventionally shorthanded as the significance level based on how good one wants to feel that one's on to something.

Take the "significant at the 95% level* idea. It's hard to get anything published that doesn't pass the threshold of 1 - \alpha = 0.95. I think of it as passing the laugh test. Why? Because it means that there is a 1 chance in 20 that the result is due solely to random variation. But is that really good? Nineteen other researchers could duplicate your analysis on 19 different groups of patients similarly consituted and it's possible that none of them might get results as good.

Imagine a drinking party. There are four five chamber revolvers on the table. One chamber in one of the revolvers has a bullet. Pick it up, put it to your temple and pull the trigger because there's only 1 chance in 20 that you will die? Probably not unless you are a character in a 19th century Russian novel.

On the other hand, imagine that there is a full auditorium of tables full of revolvers, 10,000 of them and still only 1 bullet. For riches beyond reckoning, pick one and pull the trigger. Figure the odds and consider if your assessment differs.

There's a second important think to think about. Any statistical test represents reducing a collection of observations to a measure that a level of confidence can be put on. In this case, if you begin, as you always should, by choosing \alpha and you choose 0.05, there's not much to encourage further inquiry along this line other than possibly the idea that the time variable might hold some interest. If possible, it could be worth revisiting the subject looking just at that one variable, using

short_fit <- glm(Discooration ~Time.in.months,d,family=binomial)
summary(short_fit)


You will find that things look different when other variables are discarded. This brings up a critical idea in statistics. A statistical result or model without a theoretical explanation is only a curiosity. Think about what @startz says about the relationship between the time and dose variable. Maybe there is some gender bimorphism involved in the response, also.