How to statistically model a question of type "check all that apply"? [stats::binom.test(), rstatix::pairwise_mcnemar_test(), or otherwise?]

I have questionnaire data of the kind "check all options that apply", and I wonder how to analyze it properly such that:

  • I account for by-subject nature of the data
  • I'll be able to compare different options against each other
  • I get estimates
  • I get confidence intervals

I've found that Cochran's Q & McNemar Chi-squared tests might be good options. However, I'm not sure what would be a proper analysis that achieves all the requirements I listed above.

To anchor my question with data, please consider the following reproducible example.


I have distributed a questionnaire in which I asked people a single question:
here are some common types of food, please mark all the types you like (all that apply):
☐ broccoli
☐ pizza
☐ pie
☐ spinach
☐ sausage
☐ orange
☐ blueberries muffin
☐ chocolate cake
☐ pasta
☐ fried potatoes
☐ burrito
☐ mac and cheese

The data looks like that:

# just run the following to simulate the data
library(tidyverse)
library(ids)

food_table <-
  tibble::tribble(
                ~food, ~prop,
           "broccoli", 0.482,
              "pizza",  0.82,
                "pie", 0.299,
            "spinach", 0.234,
            "sausage",   0.4,
             "orange",  0.05,
 "blueberries_muffin", 0.269,
     "chocolate_cake", 0.063,
              "pasta", 0.025,
     "fried_potatoes", 0.122,
            "burrito", 0.117,
     "mac_and_cheese", 0.032
 )

rowwise_data <-
  food_table %>%
  mutate(binomial_data = map(.x = prop, .f = ~rbinom(n = 3000, size = 1, prob = .x))) %>%
  select(-prop) %>%
  pivot_wider(names_from = food, values_from = binomial_data) %>%
  unnest(cols = everything()) %>%
  mutate(ids = uuid(n = nrow(.)), .before = everything())

rowwise_data
#> # A tibble: 3,000 x 13
#>    ids              broccoli pizza   pie spinach sausage orange blueberries_muf~
#>    <chr>               <int> <int> <int>   <int>   <int>  <int>            <int>
#>  1 02d95100-719e-4~        0     1     0       0       1      0                0
#>  2 fd1f1759-721d-4~        0     0     0       1       1      0                0
#>  3 fa33ea76-6149-4~        0     1     1       0       0      0                0
#>  4 6664d209-766e-4~        0     1     0       0       0      0                0
#>  5 db9b3466-d932-4~        1     1     1       1       1      0                0
#>  6 22edb5dc-4519-4~        1     1     0       0       1      0                0
#>  7 3f05a385-0dcc-4~        0     1     0       1       1      0                0
#>  8 909f2f1d-c37b-4~        0     1     0       0       0      0                0
#>  9 3160fbcf-b8d0-4~        0     1     1       0       1      0                1
#> 10 ce204abf-3960-4~        1     1     0       0       0      0                0
#> # ... with 2,990 more rows, and 5 more variables: chocolate_cake <int>,
#> #   pasta <int>, fried_potatoes <int>, burrito <int>, mac_and_cheese <int>

Created on 2022-01-27 by the reprex package (v2.0.1.9000)

We can see that each row pertains to one individual who answered the questionnaire. We also have one column per each option, and values 0 or 1 denoting whether the person checked that option or not.

How to statistically model this data?

option A -- using `binom.test()`
One option would be to run `stats::binom.test()` on each column. Doing so we will get estimates and confidence intervals for each *option* separately, and regardless of survey respondents (i.e., per option across all respondents). Here's an example of such analysis:
separate_binom_test <-
  rowwise_data %>%
  pivot_longer(-ids) %>%
  group_by(name) %>%
  nest() %>%
  mutate(n_of_success = map_int(.x = data, .f = ~ sum(.x$value)),
         n_of_trials = map_int(.x = data, .f = ~ length(.x$value)),
         test_result = map2(.x = n_of_success, .y = n_of_trials, .f = ~binom.test(.x, .y) %>% broom::tidy())) %>%
  select(name, test_result) %>%
  unnest_wider(test_result)

We can visualize this analysis:

separate_binom_test %>%
  ggplot(aes(x = name, y = estimate, fill = as.factor(name))) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2) +
  geom_text(aes(label = scales::percent(estimate, 2)), vjust = 2) +
  coord_flip() +
  theme_minimal()

Created on 2022-01-27 by the reprex package (v2.0.1.9000)

We got:
✓ estimates
✓ confidence intervals

We didn't get:
✗ accounting for by-subject nature of the data
✗ being able to compare different options against each other (well kind of yes, we can compare CIs, but not real contrasts...)

option B -- using Cochran's Q / McNemar Chi-squared tests
library(rstatix)

rowwise_data %>%
  pivot_longer(-ids) %>%
  mutate(across(value, as.factor)) %>%
  group_by(ids) %>%
  ungroup() %>%
  pairwise_mcnemar_test(., value ~ name|ids)
#> # A tibble: 66 x 6
#>    group1             group2                 p     p.adj p.adj.signif method    
#>  * <chr>              <chr>              <dbl>     <dbl> <chr>        <chr>     
#>  1 blueberries_muffin broccoli       6.98e- 73 4.61e- 71 ****         McNemar t~
#>  2 blueberries_muffin burrito        7.76e- 44 5.12e- 42 ****         McNemar t~
#>  3 blueberries_muffin chocolate_cake 7.55e- 99 4.98e- 97 ****         McNemar t~
#>  4 blueberries_muffin fried_potatoes 5.43e- 41 3.58e- 39 ****         McNemar t~
#>  5 blueberries_muffin mac_and_cheese 3.8 e-132 2.51e-130 ****         McNemar t~
#>  6 blueberries_muffin orange         1.08e- 96 7.13e- 95 ****         McNemar t~
#>  7 blueberries_muffin pasta          7.79e-132 5.14e-130 ****         McNemar t~
#>  8 blueberries_muffin pie            1.37e-  3 9.04e-  2 ns           McNemar t~
#>  9 blueberries_muffin pizza          2.43e-316 1.60e-314 ****         McNemar t~
#> 10 blueberries_muffin sausage        1.23e- 28 8.12e- 27 ****         McNemar t~
#> # ... with 56 more rows

Created on 2022-01-27 by the reprex package (v2.0.1.9000)

In contrast to option A, here we got:
✓ accounting for by-subject nature of the data
✓ being able to compare different options against each other
but not
✗ estimates
✗ confidence intervals

In addition, with this option B, I'm not sure how to visualize the data.


Bottom line

Options A vs. B offer different (competing?) ways to model the data. I do not know how to combine the two or if it's statistically okay to do. I feel more comfortable with option B, but then again: I don't have estimates and CIs and thus am limited with visualizing :confused:

Any idea how to reconcile a unified analysis?

Thanks!

> food_table[,2] |> sum()
[1] 2.913

So, on average, respondents chose around three food items. One way that might be interesting in whether food types correlated.

The plot below is washed out, because the data is random, so correlations cluster around zero.

I'm suggesting that it's premature to select a model before formulating questions, such as *Do spinach respondents also choose desert?" This will provide insight to look at such questions—how do the odds ratio of choosing pizza change with the presence of another carbo dish and a fatty dish?

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggcorrplot)
#> Loading required package: ggplot2
library(ids)
library(purrr)
library(tidyr)

food_table <-
  tibble::tribble(
    ~food, ~prop,
    "broccoli", 0.482,
    "pizza",  0.82,
    "pie", 0.299,
    "spinach", 0.234,
    "sausage",   0.4,
    "orange",  0.05,
    "blueberries_muffin", 0.269,
    "chocolate_cake", 0.063,
    "pasta", 0.025,
    "fried_potatoes", 0.122,
    "burrito", 0.117,
    "mac_and_cheese", 0.032
  )

rowwise_data <-
  food_table %>%
  mutate(binomial_data = map(.x = prop, .f = ~rbinom(n = 3000, size = 1, prob = .x))) %>%
  select(-prop) %>%
  pivot_wider(names_from = food, values_from = binomial_data) %>%
  unnest(cols = everything()) %>%
  mutate(ids = uuid(n = nrow(.)), .before = everything())

rowwise_data
#> # A tibble: 3,000 × 13
#>    ids              broccoli pizza   pie spinach sausage orange blueberries_muf…
#>    <chr>               <int> <int> <int>   <int>   <int>  <int>            <int>
#>  1 e2dac30f-047c-4…        0     1     1       0       1      0                0
#>  2 9d96697e-1843-4…        1     1     0       1       0      0                0
#>  3 ce78c6d5-1982-4…        1     1     0       0       0      0                0
#>  4 d7490bca-9743-4…        1     1     0       0       1      0                0
#>  5 d541fe6a-4737-4…        0     1     0       1       0      0                1
#>  6 84bd6454-72c7-4…        1     1     1       0       0      0                0
#>  7 eb70af64-48c7-4…        1     1     0       0       0      0                1
#>  8 066f2035-2398-4…        1     1     0       1       0      0                0
#>  9 6f865ff4-d1cb-4…        0     1     1       0       1      0                0
#> 10 4e9b33bd-3684-4…        0     0     0       0       0      0                0
#> # … with 2,990 more rows, and 5 more variables: chocolate_cake <int>,
#> #   pasta <int>, fried_potatoes <int>, burrito <int>, mac_and_cheese <int>

rw <- rowwise_data[2:13]

model.matrix(~0+., data=rw) %>% 
  cor(use="pairwise.complete.obs") %>% 
  ggcorrplot(show.diag = F, type="lower", lab=TRUE, lab_size=2, colors = c("#6D9EC1", "white", "#E46726"))

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.