Exclude values based on GLM's generated plot/curve

jubejube · November 27, 2020, 9:29am

have this function that generates a midpoint/decision boundary over certain subgroups of my dataframe -- I've tried creating a dummy dataframe, denoted as "df" below to work with.

## glm that generates a midpoint/decision boundary, wrapped into a function

get_midpoint = function(data){
      glm.1 = glm(coderesponse~stimulus, family = binomial(link="logit"), data=data, na.action = na.exclude)
      rtn = -glm.1$coefficients[1]/glm.1$coefficients[2]
      rtn
}

## a mini dummy dataframe 

subject <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)
stimulus = c(1, 5, 50, 35, 23, 2, 4, 22, 15, 6, 20, 40, 45, 10, 37, 43, 48, 7, 19, 21, 29, 49, 26, 11, 36, 30, 39, 41, 16, 37, 1, 5, 50, 35, 23, 2, 4, 22, 15, 6, 20, 40, 45, 10, 37, 43, 48, 7, 19, 21, 29, 49, 26, 11, 36, 30, 39, 41, 16, 37)
stimtype <- c('bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm')
blocktype <- c('mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose')
coderesponse <- c(1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0)

df = data.frame(subject, stimulus, stimtype, blocktype, coderesponse)

## running the following function over defined subgroups of ~80 rows each [for the real data]
## but for the dummy dataframe, only runs over ~5 rows per subgroup

df = df %>% 
  nest(data=-c(subject, stimtype, blocktype)) %>%
  mutate(midpoint=map_dbl(df, get_midpoint)) %>%
  unnest()

## basic code that plots and creates a curve based on a single glm result...?

plot(df$stimulus,df$coderesponse,xlab="stimulus",ylab="Probability of d responses")
curve(predict(glm.1,data.frame(stimulus=x),type="response"),add=TRUE)

I was wondering if it's possible to turn the plot + curve pieces of code into a function that can also be run over the same subgroups? Also, as an extension of the initial question, is it possible then to exclude glm values if their corresponding curve does not meet a certain criteria e.g. whatever makes the curve look flat or not as "S" shaped?

I hope this makes sense -- I'm still learning about modelling/plotting in R. Thank you for any advice in advance!

nirgrahamuk · November 27, 2020, 4:13pm

When I first read your post, I thought perhaps you had most of a solution, and wanted help with a detail, i.e. the last line.. but then I tried your example code, and at first I thought it only failed because of differences in variable names that you use, stimulus vs stim, but then I realised that the logic isnt implemented as you suggest it might be, there is nothing to group by 5's etc...

perhaps you should look into the slider package for applying functions over windows of rows

jubejube · November 27, 2020, 7:32pm

Ah so sorry, I've edited the variable names in the code so it should be working now to create a midpoint number for every subgroup of 'stimtype and blocktype (e.g. nd & mouth)'.

I'm having trouble with the second part -- modifying the code that plots and creates a curve to work with the same set of data and possibly eliminate values that do not "fit" the curve.

May I ask what the "slider" package does?

Thanks!

nirgrahamuk · November 27, 2020, 9:31pm

but this gives errors even now

df %>% 
  nest(data=-c(subject, stimtype, blocktype)) %>%
  mutate(midpoint=map_dbl(df, get_midpoint)) %>%
  unnest()

 Error: Problem with `mutate()` input `midpoint`.
x numeric 'envir' arg not of length one
i Input `midpoint` is `map_dbl(df, get_midpoint)`

My thought is that perhaps you meant to do :

df = df %>% 
  nest(data=-c(subject, stimtype, blocktype)) %>%
  mutate(midpoint=map_dbl(data, get_midpoint)) %>%
  unnest(data)

but notice the warnings

Problem with `mutate()` input `midpoint`.
i glm.fit: fitted probabilities numerically 0 or 1 occurred

I would think this would call into doubt the whole enterprise, as you couldnt trust such coefficients... unless perhaps your 80 size groups avoids this ?

What you are doing seems to have a loess type of feel, are you sure you woudnt rather just use that ?

jubejube · November 27, 2020, 10:23pm

Hmm, that's odd. I don't seem to receive those errors when I run the code over the 80 size groups! I do get the errors with the dummy group, perhaps it's because the stimulus and response values were inputted at random so the coefficients become kind of buggy?

The method we are using now has been adapted from a previous study, therefore I am not familiar with the loess approach -- how would I go about that?

nirgrahamuk · November 27, 2020, 10:34pm


subject <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)
stimulus = c(1, 5, 50, 35, 23, 2, 4, 22, 15, 6, 20, 40, 45, 10, 37, 43, 48, 7, 19, 21, 29, 49, 26, 11, 36, 30, 39, 41, 16, 37, 1, 5, 50, 35, 23, 2, 4, 22, 15, 6, 20, 40, 45, 10, 37, 43, 48, 7, 19, 21, 29, 49, 26, 11, 36, 30, 39, 41, 16, 37)
stimtype <- c('bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm')
blocktype <- c('mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose')
coderesponse <- c(1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0)

df = data.frame(subject, stimulus, stimtype, blocktype, coderesponse)

## running the following function over defined subgroups of ~80 rows each [for the real data]
## but for the dummy dataframe, only runs over ~5 rows per subgroup
library(tidyverse)

ggplot(data=df,
       aes(y=coderesponse,
           x=stimulus,
           color=stimtype)) + geom_point() +
  geom_smooth()

# span	
# the parameter α which controls the degree of smoothing.
ggplot(data=df,
       aes(y=coderesponse,
           x=stimulus,
           color=stimtype)) + geom_point() +
  geom_smooth(span=1.5 )

span .75 default

span 1.5

jubejube · November 27, 2020, 11:41pm

Ah, thank you!

edit: I found out that we actually had computed loess plots for the raw data. For the coded data, we wanted to enter it into a logistic regression to generate each participant's exact boundary. From then on, we've only been excluding values that were negative numbers or out of range, but we wanted to explore any finer-grain ways of excluding values. Therefore, we've been looking at possibly excluding via curves/plots generated.

To better word our concerns...

We're wondering how to predict and plot all of the possible curves generated from each subgroup at once?
If it's even possible to possible to exclude/filter based on the shape of the curves predicted from each estimated boundary?

Thanks for any help or insight that anyone can provide!

system · December 18, 2020, 11:41pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.