# Regression/Goodness-of-fit with summarized data

Hi there, just to start off, I am new to R and essentially trying to learn how to use R by jumping in with a project. I will try to simplify the idea of my project (this is not my actual project – just used for simplicity). I have almost 400,000 individual people and asked them if they drink tea or not. In addition to determining if they drink tea, I recorded several other demographics such as what country they are in, gender, ethnicity, and several other characteristics. All of the 400,000 responses are binned and placed into a summary table. I would like to figure out which demographics of these individuals is associated with tea drinking. An example of some data is below. How would I go about doing a regression or goodness-of-fit model based on the summary table? Thanks in advance for any help.

Tea Use
Location Yes No
Australia 40 388
London 68719 150043
United States 20573 141608
Tea Use
Gender Yes No
Male 49376 164396
Female 44176 141602

It will be worth your while to get Friendly, M. & Meyer, D. (2016). Discrete Data Analysis with R: Visualization and Modeling Techniques for Categorical and Count Data . Boca Raton, FL: Chapman & Hall/CRC after reviewing the {vcdExtra} website and the tutorial vignette in that package.

Is it possible to use glm(Location ~., family=binomial(link = "logit"),data=Location)) for example on this binned data? If I try to use the function as written here, I get an error "Error in eval(family\$initialize) : y values must be 0 <= y <= 1"

your formula says you are making a model to determine the Location's ? seems wrong.

1 Like

I want to try to figure out which Location is associated most with Tea use as from the example above

to find the highest proportion of Yes/No requires only arithmetic and sorting.

(dat_1 <- data.frame(
stringsAsFactors = FALSE,
Location = c("Australia", "Canada", "London", "United States"),
Yes = c(40L, 4219L, 68719L, 20573L),
No = c(388L, 13959L, 150043L, 141608L)
))

library(tidyverse)

(dat_2 <- mutate(dat_1,
prop = Yes/(No+Yes)) |>
arrange(desc(prop)))

#top of the list
slice_max(dat_2,order_by = prop)

1 Like

Thanks for the reply and helping out. Now let's say if I wanted to compare multiple different characteristics (Location, Gender, Race, etc.) to find out which characteristics are mostly associated with Tea use, would you still deal only with the proportions of Yes/No?

(1) if you are trying to explain tea use, tea is the dependent variable and Location, etc. are independent variables. (See @nirgrahamuk's comment.)

(2) The logit that you are using is for data where the left-hand side variable is zero or one, not a proportion.

(3) Since you have the individual data, why not use it rather than aggregating first?

2 Likes