cor.test between paired samples, stuck after several iteration on for lop

statistics

#1

I have a matrix (1580 x 37), I need to compute cor.test between each row.

n = nrow(filtered)
c =ncol(filtered)

cor.table = NULL
x = NULL
y = NULL
cor.val = NULL
cor.sig = NULL
library(psych)

for  (i in 1 : (n-1)) {
  x_name = rownames(filtered)[i]
  x_exps = filtered[i, ]	
  
  for (j in (i+1) : n) {
    y_name = rownames(filtered)[j]
    y_exps = filtered[j, ]
    
    output = cor.test(x_exps,y_exps)
    
    x = c(x, x_name)
    y = c(y, y_name)
    cor.val = c(cor.val, output$estimate)
    cor.sig = c(cor.sig, output$p.value)

  }
}
adjusted_p_cor <- p.adjust(cor.sig, method = "fdr", n = length(cor.sig))
cor.table = data.frame (x, y, cor.val, adjusted_p_cor)

For a small matrix the code works fine and return me correct results:

x y cor.val adjusted_p_cor
sample_1 Sample_2 -0.46 0.008
sample_1 Sample_3 0.82 5.50

and so on ....

For a big matrix, calculation is slow and R stuck after several hours.

I will appreciate if anyone let me know alternative solutions.


#2

It looks like your code grows the output vectors each time through the loop, rather than allocating vectors of the required size up front. This is the second circle of hell in the R Inferno and is one factor that's probably slowing down the processing.

Instead, you could declare objects of the required size up front. For example:

x = rep(NA_real_, ncol(combn(1580,2)))

Here's another approach that uses pmap to reduce the amount of code needed for the iteration:

library(tidyverse)

# Fake data
set.seed(2)
dd = matrix(rnorm(1580*37), ncol=37)

# Get row pairs
row.pairs = t(combn(1580, 2)) %>% as.data.frame()

# Run cor.test on every pair of rows and store in a list
cor.list =  row.pairs %>% 
  pmap(~ cor.test(dd[.x,],dd[.y, ]))

# Name list elements for the row pairs that were tested
names(cor.list) = apply(row.pairs, 1, paste, collapse="-")

Or, if you want to return a data frame with just the estimate and the p-value:

cor.df = map2_df(row.pairs[,1], row.pairs[,2],
                   function(a, b) {
                     ct = cor.test(dd[a, ], dd[b, ])
                     data.frame(Sample_1=a, 
                                Sample_2=b,
                                estimate=ct$estimate,
                                p.value=ct$p.value)
                   })

You're doing more than 1.2 million tests, so it's probably going to take a while in any case. On my new Macbook Pro the pmap approach took about 2.5 minutes.

Regardless of the how this is coded, does it make analytical sense to run so many significance tests? It seems likely that you'd have serious multiple testing issues, a lot of false positives, and a high risk of many Type M and Type S errors.


#3

Thank you so much for the reply.
I'm not statistician, I read about multiple testing issues and I tried to avoid this issue applying false discovery rate that is used as an alternative to the Bonferroni.

adjusted_p_cor <- p.adjust(cor.sig, method = "fdr", n = length(cor.sig))

Multiple comparisons

I'm not sure what i did but that is what I understood.
Do you think is it good way to solve the issue?


#4

I didn't notice those last two rows of your code where you do the multiple comparison correction. I'm not an expert in this area and I would suggest asking a question on CrossValidated to get expert advice.

In my experience, multiple comparison corrections like Bonferroni are the "standard" approach. However, the number of comparisons you're making is unusually huge and the number of observations (37) is small, so the results will be very noisy. Also, although significance tests have traditionally been a standard "gatekeeper" for admission of statistical results, there is somewhat of a reformation going on in the statistics profession regarding the appropriate role and use of significance tests, since such tests are so frequently abused. See here and here for example.


#5

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.