Regression/Goodness-of-fit with summarized data

Hi there, just to start off, I am new to R and essentially trying to learn how to use R by jumping in with a project. I will try to simplify the idea of my project (this is not my actual project – just used for simplicity). I have almost 400,000 individual people and asked them if they drink tea or not. In addition to determining if they drink tea, I recorded several other demographics such as what country they are in, gender, ethnicity, and several other characteristics. All of the 400,000 responses are binned and placed into a summary table. I would like to figure out which demographics of these individuals is associated with tea drinking. An example of some data is below. How would I go about doing a regression or goodness-of-fit model based on the summary table? Thanks in advance for any help.

                Tea Use
Location Yes No
Australia 40 388
Canada 4219 13959
London 68719 150043
United States 20573 141608
                Tea Use
Gender Yes No
Male 49376 164396
Female 44176 141602

It will be worth your while to get Friendly, M. & Meyer, D. (2016). Discrete Data Analysis with R: Visualization and Modeling Techniques for Categorical and Count Data . Boca Raton, FL: Chapman & Hall/CRC after reviewing the {vcdExtra} website and the tutorial vignette in that package.

Is it possible to use glm(Location ~., family=binomial(link = "logit"),data=Location)) for example on this binned data? If I try to use the function as written here, I get an error "Error in eval(family$initialize) : y values must be 0 <= y <= 1"

your formula says you are making a model to determine the Location's ? seems wrong.

1 Like

I want to try to figure out which Location is associated most with Tea use as from the example above

to find the highest proportion of Yes/No requires only arithmetic and sorting.

(dat_1 <- data.frame(
  stringsAsFactors = FALSE,
          Location = c("Australia", "Canada", "London", "United States"),
               Yes = c(40L, 4219L, 68719L, 20573L),
                No = c(388L, 13959L, 150043L, 141608L)
))

library(tidyverse)

(dat_2 <- mutate(dat_1,
                 prop = Yes/(No+Yes)) |> 
    arrange(desc(prop)))

#top of the list
slice_max(dat_2,order_by = prop)

answer - London

1 Like

Thanks for the reply and helping out. Now let's say if I wanted to compare multiple different characteristics (Location, Gender, Race, etc.) to find out which characteristics are mostly associated with Tea use, would you still deal only with the proportions of Yes/No?

Three comments:

(1) if you are trying to explain tea use, tea is the dependent variable and Location, etc. are independent variables. (See @nirgrahamuk's comment.)

(2) The logit that you are using is for data where the left-hand side variable is zero or one, not a proportion.

(3) Since you have the individual data, why not use it rather than aggregating first?

2 Likes

Thank you. I appreciate your comments.
The more I think about it, I am thinking just doing proportions is probably more realistic since I have the entire population for my data set.

I was initially thinking of it in terms of zero or one though - if a person in Australia drinks tea, that would be 1 and if they did not, would be 0. Then all of that was made in to the total number of 1's and 0's which is how the table was made.

It would have been possible to use the individual data but the data needed a lot of work before it was usable. So the data was aggregated as it was obtained. Each individual had a unique identifier, then the unique identifier had to be use to figure out the location, gender, race, etc. A separate identifier tied to the individual then had to be used to find out if they used tea.

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.