Run a glm over various subgroups of data & creating a new column with those values?

Hello!

I have this piece of code that runs a glm and generates the "midpoint" of a subject's responses [coded as trochiac/iambic, 0 or 1] to a list of numeric stimuli, saves the midpoint and also prints the result in the console.

glm.1 <- glm(coderesponse~stimulus, family = binomial(link="logit"), data=data)
midpoint <- -glm.1$coefficients[1]/glm.1$coefficients[2]

cat(sprintf("file : %s\nmidpoint : %.2f",datafile,midpoint))

At the moment, this code runs over the entire dataframe. I was wondering how to start modifying this code so that I could run it over various subgroups within my main dataframe and create a new column with those values for each subgroup?

For each subject, I would like to generate the midpoint value for each block (1-8) within each stimtype "bd", "nm" and "nm". That midpoint value would be the new value in the column for each block within each stimtype. We eventually want to aggregate the values of each block to be reduced to one row containing the midpoint value (rather than keeping all of the rows with the same value).

dummy version of my main dataframe:

subject <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2)
stimulus <- c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 1, 1, 1, 1, 1, 1)
block <- c(3, 3, 3, 7, 7, 7, 4, 4, 4, 8, 8, 8, 1, 1, 1, 5, 5, 5, 2, 2, 2, 6, 6, 6, 3, 3, 3, 7, 7, 7, 4, 4, 4, 8, 8, 8, 2, 2, 2, 6, 6, 6)
blockprocedure <- c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1)
stimtype <- c('bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm')
blocktype <- c('mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose')
coderesponse <- c(1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1)

dummy = data.frame(subject, stimulus, block, stimtype, blockprocedure, blocktype, coderesponse)

If anyone could provide some starting points/solutions, that would be great!

using dummy as provided, getting the midpoint for nose or mouth

library(tidyverse)

 (grps <- unique(dummy$blocktype))
grouped_mids <- map_dbl(grps,
    ~{
      glm.temp <- glm(coderesponse~stimulus, family = binomial(link="logit"), data=dummy %>% filter(blocktype==.x))
      -glm.temp$coefficients[1]/glm.temp$coefficients[2]
    })

names(grouped_mids) <- grps
grouped_mids

Thank you for providing this!

From what I understand, this code finds and stores the midpoint value for all nose or mouth rows across both subjects, correct? I'm trying to write this value to a new column (e.g. dummy$midpoint) but since the stored value includes 2 values (one for nose, one for mouth) while the dataframe has 42 rows, I'm running into errors.

I was also wondering how I could modify this to further subset across subjects & blocks? I had tried adding filters for both and using unique for multiple columns, but I don't think that was the right approach.

Thank you again!

you want the values to be repeated/duplicated to fill the space?

perhaps you have significantly more data than the sample you shared, but if not you dont seem to have enough data for each possible pairing of subjects and blocks


(grps1 <- unique(dummy$blocktype))
(grps2 <- unique(dummy$stimulus))

(groups_to_do <- expand_grid(grps1,grps2))
grouped_mids <- map2_dbl(groups_to_do$grps1,groups_to_do$grps2,
    ~{
      glm.temp <- glm(coderesponse~stimulus, family = binomial(link="logit"), data=dummy %>% filter(blocktype==.x,
                                                                                                    stimulus==.y))
      -glm.temp$coefficients[1]/glm.temp$coefficients[2]
    })

names(grouped_mids) <- grps
grouped_mids

essentially yes or create two separate dataframes (one with the new column filled with repeated data and one the new column with the aggregated data) as there may be further data analysis to be made on the full dataset

ah yes, our dataset has 8000+ rows [approx 480 rows per subject], so the midpoint would be actually be calculated over 20 rows per "block" for each stimtype.

I tried running the code [on both the dummy & main dataframe), but I receive "NA" values for the midpoint.

well, can you run a manual glm, on a filtered data for just subject 1 and block mouth ?
I assume you have insufficient data for that and would get NA.
Therefore you might need to reevaluate your requirement to get models on each combination of subject and block?

yes I can run the manual glm & the midpoint on the subsetted data from my main dataframe:
(if i'm understanding you correctly)

## essentially want to run this code over each stimtype (bd, nd, nm) & blocks (1-8) for each subject as subsetted below
data = subset(data, subject =="5" & stimtype == "bd" & block == "2")

glm.1 <- glm(coderesponse~stimulus, family = binomial(link="logit"), data=data)
midpoint <- -glm.1$coefficients[1]/glm.1$coefficients[2]

cat(sprintf("file : %s\nmidpoint : %.2f",datafile,midpoint))
midpoint : 20.27

do you receive NA values for some midpoints , or all midpoints ?
it may be that its impossible to help you without my accessing your data, or a representative sample of your data. but, it could simply be that what you might wish to do is not meaningfull statistically. Consider....

in your example data we have

subset(dummy, blocktype=='mouth' & stimulus==1)

   subject stimulus block stimtype blockprocedure blocktype coderesponse
1        1        1     3       bd              1     mouth            1
2        1        1     3       nd              1     mouth            1
3        1        1     3       nm              1     mouth            1
37       2        1     2       bd              1     mouth            0
38       2        1     2       nd              1     mouth            1
39       2        1     2       nm              1     mouth            0

here stimulus is always 1, so beyond an intercept a second coefficient cant be found to make a relationship to coderesponse. i.e. one of the 2 coefficients is NA, so some 'midpoint' value of the two coefficients will be NA