Getting odds ratio instead of odds with vcd:odds()

Apparently, I didn't pay enough attention in categorical analysis class. I'm headed for calculating odds, odds ratios, and eventually (but not here) the Cochran-Mantel-Haenszel Test. My problem is that the odds() function does not work as expected.

Some definitions:
Odds is a way of expressing probabilities, with \text{odds} = \frac{\pi}{1-\pi}. That means p=\frac{1}{2} gives odds of 1, i.e. odds are "even", "50-50".
An odds ratio \theta is exactly what it sounds like: the ratio of two odds. It measures the odds that an outcome will occur under a condition compared to the odds of the outcome occurring without the condition. For example, some joint injuries as young adults increase the chances of developing arthritis as an older adult. The odds ratio for this situation would be written

\theta=\frac{\text{odds of arthritis given injury}}{\text{odds of arthritis without injury}}

So I'm using the Whickham data set, which looked at long-term survival based on smoking. Researchers recorded a person's age and yes/no smoke levels. Twenty years later, they found the same people and (after doing proper epidemiological adjustments, etc.) recorded whether they were alive or dead. It's in the mosaicData package. I'm going to create a factor variable that groups the ages 18–34, 35–64, and 64+, and store it in WhickTable.

library(mosaicData) ## Use mosaicData::Whickham to avoid loading packages
Whickham$AgeRange <- cut(Whickham$age, c(18, 35, 65, Inf), include.lowest = TRUE, labels = c("18 – 34", "35 – 64", "65+"))
WhickTable <- xtabs( ~ smoker + outcome + AgeRange, data=Whickham)


The 18–34 group in WhickTable is

, , AgeRange = 18 – 34

outcome
No    236    7
Yes   182    5


To compute the odds for being Alive in the 18–34 group, calculate \frac{\text{Number present}}{\text{Number absent}} in the group:
The odds of being Alive + No Smoke is \frac{36}{7} = 33.7143.
The odds of being Alive + Smoke is \frac{182}{5} = 36.4. (I know this seems counterintuitive, but this is also a great Simpson's Paradox example).
Therefore the odds ratio is

\theta_{\text{18–34}}=\frac{33.7143}{36.4}=0.926217

and for the epidemiologists our there, the log odds ratio:

\log{\theta_{\text{18–34}}}=\log0.926217=-0.0766

...and we could do that for each of the three groups. Note that the odds are in the 30s and the odds ratio is 0.926217.

Tons of good categorical stuff is in Michael Friendly's vcd package, including a function to compute the stuff above

library(vcd)
oddsratio(WhickTable, log=FALSE)
# odds ratios for smoker and outcome by AgeRange
#
#  18 – 34   35 – 64       65+
#0.9262166 1.2647636 1.1559140


Huzzah! We calculated the odds ratio correctly for the 18–34 group!
According to the vcd::lodds() documentation, I should be able to calculate the odds (33.7143 and 36.4) using the odds() command.

> odds(WhickTable, log=FALSE)
# odds for smoker by outcome, AgeRange
#
#       AgeRange
#outcome  18 – 34   35 – 64      65+
#  Alive 1.296703 0.9450980 4.166667
#  Dead  1.400000 0.7472527 3.604651


Nope. maybe I should try the lodds() command, intended for calculating log odds, but set log = FALSE

lodds(WhickTable, log=FALSE)
# odds for smoker by outcome, AgeRange
#
#       AgeRange
#outcome  18 – 34   35 – 64      65+
#  Alive 1.296703 0.9450980 4.166667
#  Dead  1.400000 0.7472527 3.604651


Looks like odds() and lodds() do the same thing.

Any idea why odds() isn't giving me the non-logged odds? I'll bet something's staring me in the face, right there in the docs, but I'm not seeing it.

I think this is the reason:

The odds are calculated based on the following two tables:

> xtabs(formula = ~ (smoker + AgeRange),
+       subset = (outcome == "Alive"),
+       data = dataset)
AgeRange
smoker 18 – 34 35 – 64 65+
No      236     241  25
Yes     182     255   6


and

> xtabs(formula = ~ (smoker + AgeRange),
+       subset = (outcome == "Dead"),
+       data = dataset)
AgeRange
smoker 18 – 34 35 – 64 65+
No        7      68 155
Yes       5      91  43


To get what you want, you can try reordering like follows:

> vcd::odds(x = xtabs(formula = ~ (outcome + smoker + AgeRange),
+                     data = dataset))
odds for outcome by smoker, AgeRange

AgeRange
smoker  18 – 34  35 – 64       65+
No  33.71429 3.544118 0.1612903
Yes 36.40000 2.802198 0.1395349


Here, dataset in my code is same as Whickham. Hope this helps.

PS

I failed to get the Simpson's paradox reference in this case. Here, we're not combining groups, right?

Your solution certainly looks good, although I’m now away from that computer and can’t run it.

Simpson’s paradox will arise when you combine all the age groups together. If I remember correctly, each group will show something similar to this one—death rates are lower for the smoke group when cut into three groups, but are not when aggregated.

It better work out that way—I’ve snaked this data through the chapter, illustrating other things, so that this final punch line of Simpson’s can end the narrative.

...and while I have you: I can do a CMH test on smoke vs survival with age as my third variable, right?

That is, my CMH test would test H₀: θ₁₈₋₃₄= θ₃₅₋₆₄ = θ₆₅₊. That the odds ratios for the three tables is the same. That's my plan, but as you can tell it's been a long time since dealing with 2×2 contingency tables. Seems like there's more tests than usual that are simply last names (McNemer, Cochran–Mantel–Haenszel, Breslow-Day...)

This makes sense. I misinterpreted your question. Sorry.

I don't know about CMH test, as I skipped the Categorical Analysis paper last year for no particular reason (and I still regret it ). So, I shouldn't answer this question, but after taking a quick look at the Wikipedia page, I think you can do this.

Categorical Analysis seems to sit on its own and is easy to brush by. If my knowledge went a little deeper and more theoretical, I'm sure I'd see this as just another case of the Generalized Linear Model and it would all be easy. Unfortunately, that's not where I am.

(and I'm sure the CMH will work here. As I've said in other posts, this is my first time writing a book alone, and although it's my fifth, not having another person to bounce things off of makes the experience completely new.)

@Yarnabrina and others,

I seem to have cornered myself again. I found the original paper that discussed this data set (Ignoring a Covariate: An Example of Simpson's Paradox) and it splits the data into smaller groups to illustrate Simpson's Paradox: 18–24, 25–34, 35–44, 45–54, 55–64, 84–75, and 75+. I changed my code to match these cuts.

library(mosaicData) # or mosaicData::Whickham
Whickham$AgeRange <- cut(Whickham$age, c(18, 25, 35, 45, 55, 65, 75, Inf), include.lowest = TRUE, labels = c("18–24", "25–34", "35–44", "45–54", "55–64", "65–74", "75+"))
WhickTableSOA <- table(Whickham$smoker, Whickham$outcome, Whickham$AgeRange) # SOA for$smoker, $outcome, and$AgeRange
addmargins(WhickTableSOA) # ...to show the grand total of smokers
#  <snip>
#, ,  = Sum
#
#
#     Alive Dead  Sum
#  No    502  230  732
#  Yes   443  139  582
#  Sum   945  369 1314
# These numbers match with those int he article, so I have the same data


All this really shows is that all these marginal totals are the same as those in the paper. That's not really important, and I haven't checked all 1314 data points to see if they are identical.In fact, it looks like the numbers don't match exactly, but that's fine. What I don't understand is simply how the odds are calculated below.

Unless I'm really confused, I do exactly as above; so odds for 18–24 should be (85÷1=85) and (67÷2=33.5). Odds for 25–34 should be (151÷6=251.7) and (115÷3=38.3). But the odds() function isn't reporting those numbers. Above, we divided the "Alive" values by the "Dead" values to produce odds for three categories. Now that I've split into more than three categories, nothing should change (should it?). Yet it has. If you look at the corresponding parts of these tables, you'll see that the odds are not the same when calculated as above.

The reprex() is by hand, since executing reprex() with the output on the clipboard causes an error.

# This is what we had above (after removing whitespace etc.)
library(vcd)
# Odds are calculated using these two tables
xtabs( ~ smoker + AgeRange, subset = (outcome == "Alive"), data = Whickham)
#      AgeRange
#smoker 18–24 25–34 35–44 45–54 55–64 65–74 75+
#   No     85   151    99    70    72    25   0
#   Yes    67   115    97    98    60     6   0
xtabs( ~ smoker + AgeRange, subset = (outcome == "Dead"), data = Whickham)
#      AgeRange
#smoker 18–24 25–34 35–44 45–54 55–64 65–74 75+
#   No      1     6     7    15    46    93  62
#   Yes     2     3    15    25    51    31  12

# The following is easier to read when you see that the xtabs()
# for odds, oddsratio, and mantelhaen.test() are identical.

odds(xtabs(~ (outcome + smoker + AgeRange), data = Whickham))
# odds for outcome by smoker, AgeRange
#
#      AgeRange
#smoker 18–24    25–34     35–44    45–54    55–64     65–74   75+
#    No    57 23.30769 13.266667 4.548387 1.559140 0.2727273 0.008
#   Yes    27 33.00000  6.290323 3.862745 1.174757 0.2063492 0.040

oddsratio(xtabs(~ (outcome + smoker + AgeRange), data = Whickham), log = FALSE)
# odds ratios for outcome and smoker by AgeRange
#
#    18–24     25–34     35–44     45–54     55–64     65–74       75+
#2.1111111 0.7062937 2.1090598 1.1775012 1.3272016 1.3216783 0.2000000

mantelhaen.test(xtabs(~ (outcome + smoker + AgeRange), data = Whickham)))
#
#	Mantel-Haenszel chi-squared test with continuity correction
#
#data:  xtabs(~(outcome + smoker + AgeRange), data = Whickham)
#Mantel-Haenszel X-squared = 2.7639, df = 1, p-value = 0.09641
#alternative hypothesis: true common odds ratio is not equal to 1
#95 percent confidence interval:
# 0.9640376 1.9058102
#sample estimates:
#common odds ratio
#          1.35546


Since the odds don't seem to be calculating correctly, I'm not expecting to get the rest to work out correctly either, but I'm including everything for completeness.

If it matters, the error I get when I try to execute reprex() is

>reprex()
Removing leading prompts from reprex source.
Rendering reprex...
Error in parse(text = x, keep.source = TRUE) :
<text>:23:8: unexpected numeric constant
22:       AgeRange
23: smoker 18


I believe you're looking for something like this:

dataset <- within(data = mosaicData::Whickham,
expr = {
AgeRange <- cut(x = age,
breaks = c(18, 25, 35, 45, 55, 65, 75, Inf),
include.lowest = TRUE,
labels = c("18–24", "25–34", "35–44", "45–54", "55–64", "65–74", "75+"))
})

OAS_table <- xtabs(formula = ( ~ (outcome + AgeRange + smoker)),
data = dataset)
OAS_table
#> , , smoker = No
#>
#>        AgeRange
#> outcome 18–24 25–34 35–44 45–54 55–64 65–74 75+
#>   Alive    85   151    99    70    72    25   0
#>   Dead      1     6     7    15    46    93  62
#>
#> , , smoker = Yes
#>
#>        AgeRange
#> outcome 18–24 25–34 35–44 45–54 55–64 65–74 75+
#>   Alive    67   115    97    98    60     6   0
#>   Dead      2     3    15    25    51    31  12

# odds for being Alive for non-smokers (or smokers) of particular age groups
vcd::odds(x = OAS_table,
correct = FALSE) # avoiding continuity correction
#>  odds for outcome by AgeRange, smoker
#>
#>         smoker
#> AgeRange         No        Yes
#>    18–24 85.0000000 33.5000000
#>    25–34 25.1666667 38.3333333
#>    35–44 14.1428571  6.4666667
#>    45–54  4.6666667  3.9200000
#>    55–64  1.5652174  1.1764706
#>    65–74  0.2688172  0.1935484
#>    75+    0.0000000  0.0000000


Created on 2019-04-22 by the reprex package (v0.2.1)

Changes:

1. Once again, I reordered. I noted that you're interested in the odds of being Alive among non-smokers (or, smokers) of an age group. So, I placed outcome at the first. I put smoker at the end, but in this case, ordering of smoker and AgeRange doesn't matter (obviously that's expected (at least to me, but I'm not confident because of my lack of knowledge in CDA), because you're sort of grouping by both of them).

2. I added correct = FALSE to avoid the continuity correction, and vcd::odds are using it as TRUE for this case since there's two empty cells corresponding to the last age group. Here's from the documentation:

correct
logical or numeric. Should a continuity correction be applied before computing odds? If TRUE, 0.5 is added to all cells; if numeric (or an array conforming to the data) that value is added to all cells. By default, this not employed unless there are any zero cells in the table, but this correction is often recommended to reduce bias when some frequencies are small (Fleiss, 1981).

Once again, I'll emphasize that I do not know how necessary this correction is.

Hope this helps.

Looks wonderful! I thought I tried the term switching, but that Yates continuity correction slipped by me (even though it’s in the output title!)

Why did you need the expr{} block in the first line?

I’m guessing the expr{} is because within() requires it. I would not have thought to use within()—what made you think to?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.