How to print the proportion of values in a list where P < 0.05?

Hello all,

I have this function within a loop:

data2 <- anova(lm(y~x1+groupcategory))$"Pr(>F)"

Which collects the P values from the anova I have run above.

Print(data2) gives me the P-value for each iteration in a list.

However, I want to calculate the proportion of values in this list that have a value of P <0.05.

Does anybody have any ideas?

If you are storing the p values in a vector called pValues, you can find the proportion < 0.05 with

sum(pValues < 0.05)/length(pValues)

The comparison pValues < 0.05 returns either TRUE or FALSE where TRUE = 1 and FALSE = 0, so the sum is equal to the count of TRUE values.

Hey FJCC,

If I want to enter that function once I have completed x number of iterations and thus produced the full list of P-values, do I place that formula outside the loop parentheses {}.

Yes, you would do the calculation outside of the loop. That means you have to store all of the p values. The code

data2 <- anova(lm(y~x1+groupcategory))$"Pr(>F)"

only stores a single value. If you need help filling a vector with values, I would want to see your code for the loop before I suggest how to do that.

Could that be something like df =
df = rbind(df, data.frame(data2))
so on each iteration it adds the P-value (stored in a dataframe) into the data frame df
Or in other words, binds the two data frames?

d = data.frame()
rbind(d,data.frame(data2))

or more like this

Yes, you could store the results in a data frame. The formula I provided would have to be modified.

sum(d$data2 < 0.05)/nrow(d)

If you know how many values you will get, it is better to prepare a vector to receive them but the data frame approach will work.

Edit: Be sure to define

d = data.frame()

outside of the loop.

Instead of taking sum first and then dividing by length, mean seems to be a better option.

1 Like

Hey FJCC,

Here is my command overall:

install.packages("ggplot2")
library(ggplot2)
install.packages("tibble")
library(tibble)

coef1 <- 5
coef2 <- -0.03
coef3 <- 10
coef4 <- -0.05
distances <- c(60,90,135,202.5,303.75,455.625)
x1 <- distances
z <- coef1 + coef2x1
z1 <- coef3 + coef4
x1
n_trials <- 60
x <- 1:29
i <- 1:1000
groupcategory = c(1,1,1,1,1,1,2,2,2,2,2,2)
d = data.frame()

for (i in x) {
for (i in i){
pr = 1/(1+exp(-z))
y = rbinom(z,n_trials,pr)
df0 = data.frame(x1, y)
pr = 1/(1 + exp(-z1))
y = rbinom(z,n_trials,pr1)
df1 = data.frame(x1, y)
df5 = rbind(df1,df0)
df6 = cbind(df5,groupcategory)
data2 <- aov(y~x1+groupcategory+x1:groupcategory,data=df6)
data4 <- summary(data2)
data5 <- data4$"Pr(>F)"[3]
data6 <- data.frame(data5)
rbind(data6,d)
}
data3=sum(d$data6<0.05)/nrow(d)
print(data3)
}

So in other words,

  1. I have a sample size of 1 to 29.
  2. I am looping through each sample size 1000 times.
  3. I am working out the proportion of P-values < 0.05 over the 1000 iterations.
  4. I am doing step 3 for each sample size.

Thus: N = 1, Proportion of P<0.05= ...., N=2, Proportion of P<0.05= ...., N=3, Proportion of P<0.05= ......,

When I print the ANOVA table I get:

             Df Sum Sq Mean Sq F value   Pr(>F)    

x1 1 6281 6281 43.017 0.000177 ***
groupcategory 1 114 114 0.781 0.402535
x1:groupcategory 1 32 32 0.223 0.649708
Residuals 8 1168 146

Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

And I want to select only the 3rd P value (0.649708)...

But the function data5 <- data4$"Pr(>F)"[3] gives me 'NULL' ?!

And I'm not sure why

To answer the immediate question, you want to generate 29 values of

sum(d$data6<0.05)/nrow(d)

I would then define a vector with 29 elements before the two for loops

FracP <- vector(mode = "numeric", length = 29)

You can then store the calculated value with

FracP[i] <- sum(d$data6<0.05)/nrow(d)

putting that where you currently store that value in the variable data3.

It is not a good idea to use the variable i as the index of both for loops. That is confusing at best and I expect it will cause weird loop behavior.

I do not see where you use x, your sample size, anywhere.

Concerning getting the desired P value, data4 is a list with one element that is a data frame. Try using,

data4[[1]]["x1:groupcategory", "Pr(>F)"]

FracP[i] <- sum(d$data6<0.05)/nrow(d)

The purpose of the [I] is to...? Or should it just be FracP <- sum(d$data6<0.05)/nrow(d)

since FracP <- vector(mode="numeric",length=29)

Thanks so much!

Oh I see, so the FracP[I] allows us to store the proportion value as an index in the vector FracP.

I wrote: x <- 1:29 in the first paragraph - I wanted to iterate through 29 values, and so I stated for (I in x) as I wanted the loop to apply to each numerical value in 1 to 29.

I have modelled this simulation on a binomial distribution, where y = rbinom(e, y, z) is the number of correct responses across Y trials. The number of trials per subject is 60.
e refers to the number of observations per subject, which in this case is 6. Therefore y is the number of correct responses across 60 trials per subject. Z is the probability of success per trial which I have defined in the first paragraph.

Therefore, I want to run the loop when considering 2 subjects, 3 subjects, 4 subjects, 5 subjects until I get to 29 subjects.

I have appended my R script below:

coef1 <- 5
coef2 <- -0.03
coef3 <- 10
coef4 <- -0.05
distances <- c(60,90,135,202.5,303.75,455.625)
x1 <- distances
z <- coef1 + coef2x1
z1 <- coef3 + coef4
x1
n_trials <- 60
x <- 1:29
i <- 1:1000
e <- 6
groupcategory = c(1,1,1,1,1,1,2,2,2,2,2,2)
d = data.frame(x=rep(0,1000)
FracP <- vector(mode="numeric",length=29)

for (i in x) {
for (j in i){
pr = 1/(1+exp(-z))
y = rbinom(e,n_trials,pr)
df0 = data.frame(x1, y)
pr = 1/(1 + exp(-z1))
y = rbinom(e,n_trials,pr1)
df1 = data.frame(x1, y)
df5 = rbind(df1,df0)
df6 = cbind(df5,groupcategory)
data2 <- aov(y~x1+groupcategory+x1:groupcategory,data=df6)
data4 <- summary(data2)
data5 <- data4[[1]]["x1:groupcategory","Pr(>F)"]
data6 <- data.frame(data5)
cbind(data6,d)
}
FracP[i] <- sum(d$data6<0.05)/nrow(d)
}

If a for loop starts with

for (j in i)

and i is just a number, the for loop will run only once. Do you mean

for (j in 1:i)

yes! Oops!

And I changed the first paragraph again by amending the d=data.frame command

d = data.frame(x=rep(0,1000))

as I want to put the 1000 iterations into d

would that be the same for 1:x

so at the end of the command, the object 'FracP' should the list of proportions for each N number.

i <- 1:1000

and so for (j in 1:i) will complete the loop 1000x times?

Now I got tangled up. I missed that you defined i as 1:1000 and as the index of the outside for loop. Change one of those so that there is no conflict.

OneTo1000 <- 1:1000
OneTo29 <- 1:29
.
.
.
for ( i in OneTo29) {
  for (j in OneTo1000) {
.
.

Do not prefiill the data frame with zeros. You are using rbind to build the data frame and your data will be stacked under all of those zeros.

library(tidyverse)

coef1 <- 5
coef2 <- -0.03
coef3 <- 5
coef4 <- -0.03
distances <- c(60,90,135,202.5,303.75,455.625)
x1 <- distances
z <- coef1 + coef2*x1
z1 <- coef3 + coef4*x1
n_trials <- 60
oneto29 <- 29
oneto1000 <- 10
e <- 6 
groupcategory = c(1,1,1,1,1,1,2,2,2,2,2,2)
d = NULL
FracP <- vector(mode="numeric",length=29)
pr = 1/(1+exp(-z))
pr1 = 1/(1+exp(-z1))

for (i in 1:oneto29) {
  for (j in 1:oneto1000){
    df <- c()
    for (k in 1:oneto29){
      rbind(df, data.frame(x1 = rep(distances, 2),
                     y = c(rbinom(e,n_trials,pr), rbinom(e,n_trials,pr1)),
                     groupcategory = groupcategory, id = c(rep(k*2-1, 6), rep(k*2, 6))))
    }
    df_aov <- aov(y~x1*groupcategory+Error(id/(x1*groupcategory)),data=df)
    df_aov_sum <- summary(df_aov)
    pvalue <- df_aov_sum[[5]]
    pvalue[[1]]["groupcategory","Pr(>F)"]
    d = rbind(d,data.frame(pvalue))
  }

  count <- plyr::ldply(d,function(c) sum(c<=0.05))
  FracP[i] <- count$V1/oneto1000 
  d <- NULL
}

This is my renewed script at the moment. In other words, a vector of twenty-nine P-values will be printed at the end. With each P-value corresponding to a particular sample size. The first P-value corresponding to 1 subject, the 2nd P-value corresponding to 2-subjects and so on. The y =rbinom(x,y,z) command only considers the proportion of correct trials for a single subject, and so I manipulated the formula above to consider 2-subjects, then 3-subjects, on each iteration!

Does that look right?