Regression/Goodness-of-fit with summarized data

Vosyn · December 2, 2022, 5:30am

Hi there, just to start off, I am new to R and essentially trying to learn how to use R by jumping in with a project. I will try to simplify the idea of my project (this is not my actual project – just used for simplicity). I have almost 400,000 individual people and asked them if they drink tea or not. In addition to determining if they drink tea, I recorded several other demographics such as what country they are in, gender, ethnicity, and several other characteristics. All of the 400,000 responses are binned and placed into a summary table. I would like to figure out which demographics of these individuals is associated with tea drinking. An example of some data is below. How would I go about doing a regression or goodness-of-fit model based on the summary table? Thanks in advance for any help.

                Tea Use

Location	Yes	No
Australia	40	388
Canada	4219	13959
London	68719	150043
United States	20573	141608

                Tea Use

Gender	Yes	No
Male	49376	164396
Female	44176	141602

technocrat · December 2, 2022, 9:23am

It will be worth your while to get Friendly, M. & Meyer, D. (2016). Discrete Data Analysis with R: Visualization and Modeling Techniques for Categorical and Count Data . Boca Raton, FL: Chapman & Hall/CRC after reviewing the {vcdExtra} website and the tutorial vignette in that package.

Vosyn · December 7, 2022, 5:07pm

Is it possible to use glm(Location ~., family=binomial(link = "logit"),data=Location)) for example on this binned data? If I try to use the function as written here, I get an error "Error in eval(family$initialize) : y values must be 0 <= y <= 1"

nirgrahamuk · December 7, 2022, 5:10pm

your formula says you are making a model to determine the Location's ? seems wrong.

Vosyn · December 7, 2022, 5:14pm

I want to try to figure out which Location is associated most with Tea use as from the example above

nirgrahamuk · December 7, 2022, 5:25pm

to find the highest proportion of Yes/No requires only arithmetic and sorting.

(dat_1 <- data.frame(
  stringsAsFactors = FALSE,
          Location = c("Australia", "Canada", "London", "United States"),
               Yes = c(40L, 4219L, 68719L, 20573L),
                No = c(388L, 13959L, 150043L, 141608L)
))

library(tidyverse)

(dat_2 <- mutate(dat_1,
                 prop = Yes/(No+Yes)) |> 
    arrange(desc(prop)))

#top of the list
slice_max(dat_2,order_by = prop)

answer - London

Vosyn · December 7, 2022, 5:30pm

Thanks for the reply and helping out. Now let's say if I wanted to compare multiple different characteristics (Location, Gender, Race, etc.) to find out which characteristics are mostly associated with Tea use, would you still deal only with the proportions of Yes/No?

startz · December 7, 2022, 6:14pm

Three comments:

(1) if you are trying to explain tea use, tea is the dependent variable and Location, etc. are independent variables. (See @nirgrahamuk's comment.)

(2) The logit that you are using is for data where the left-hand side variable is zero or one, not a proportion.

(3) Since you have the individual data, why not use it rather than aggregating first?

Vosyn · December 7, 2022, 6:46pm

Thank you. I appreciate your comments.
The more I think about it, I am thinking just doing proportions is probably more realistic since I have the entire population for my data set.

I was initially thinking of it in terms of zero or one though - if a person in Australia drinks tea, that would be 1 and if they did not, would be 0. Then all of that was made in to the total number of 1's and 0's which is how the table was made.

It would have been possible to use the individual data but the data needed a lot of work before it was usable. So the data was aggregated as it was obtained. Each individual had a unique identifier, then the unique identifier had to be use to figure out the location, gender, race, etc. A separate identifier tied to the individual then had to be used to find out if they used tea.

system · January 18, 2023, 6:47pm

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.