Correlation for multiple comparisons

Bine · February 7, 2023, 9:06am

Dear all,

I wonder if you have an idea how I could do the following:

I have table with e.g. 100 individuals and their real & predicted gene expression for gene X:

Gene X:
Individual | predicted | real
1                   | 45                 | 34
2                   |  42                | 19
.....                   ......               .......

Then I calculate the correlation between predicted and real with cor.test(combined$Real Expression,combined$Predicted Expression, method="kendall").

This I did individually for a few genes.

But now I would like to do it for all 9000 genes and I dont want to run this individually 9000 times.
Has anyone an idea how I could do this?

I have a table which looks like this:

All genes

Individual | predicted  gene x| real gene x | predicted gene y | real gene y | predicted gene z | real gene z
1                   | 45                 | 34        | 34                 |   23                             | 34                  | 24                              | 23                                 
2                   |  42                | 19        | 23                 |    12                            |27                   |67                               |65 
.....                   ......               .......    ....                        ...                                 ...                      ...                                   ...



Thank you & Best Regards,
Bine

technocrat · February 7, 2023, 10:22am

It can be done with patience (I estimate it would take around 3-4 years on my machine). Calculating the test statistics pairwise on 9,000 variables involves applying cor.test 40,495,500 times: g_1,g_2 \dots g_1,g_{9000} then g_2, g_3 \dots g_2,g_{9000} \dots g_{8999},g_{9000}.

And then there's all that review to be done.

# fake some data for two-gene case
fakes <- matrix(sample(19:45,10000,replace = TRUE),nrow = 100, ncol = 2)
#> Warning in matrix(sample(19:45, 10000, replace = TRUE), nrow = 100, ncol = 2):
#> data length differs from size of matrix: [10000 != 100 x 2]
head(fakes)
#>      [,1] [,2]
#> [1,]   40   42
#> [2,]   30   37
#> [3,]   21   29
#> [4,]   32   24
#> [5,]   24   23
#> [6,]   34   28
# an individual record won't work
cor.test(fakes[1,1],fakes[1,2])
#> Error in cor.test.default(fakes[1, 1], fakes[1, 2]): not enough finite observations
# test is designed to operate this way
cor.test(fakes[,1],fakes[,2])
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  fakes[, 1] and fakes[, 2]
#> t = -0.54353, df = 98, p-value = 0.588
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  -0.2485638  0.1431372
#> sample estimates:
#>         cor 
#> -0.05482221
# or subsets
cor.test(fakes[1:1000,1],fakes[1:1000,2])
#> Error in fakes[1:1000, 1]: subscript out of bounds

# rerun with fakes of dim 100, 9000
fakes <- matrix(sample(19:45,100000,replace = TRUE),nrow = 100, ncol = 9000)
cor.test(fakes[,1],fakes[,2])
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  fakes[, 1] and fakes[, 2]
#> t = -0.39724, df = 98, p-value = 0.6921
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  -0.2346647  0.1575643
#> sample estimates:
#>         cor 
#> -0.04009468
cor.test(fakes[,1],fakes[,3]) # etc
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  fakes[, 1] and fakes[, 3]
#> t = 1.193, df = 98, p-value = 0.2357
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  -0.07861747  0.30880901
#> sample estimates:
#>       cor 
#> 0.1196482

# cor.test on all genes against each other gene
# pairwise 
pairs <- combn(9000,2)
dim(pairs)
#> [1]        2 40495500
pairs[1:2,1:3]
#>      [,1] [,2] [,3]
#> [1,]    1    1    1
#> [2,]    2    3    4

# it can be done, this way

mass_cor_test <- function(x) cor.test(fakes[,pairs[1,x]],fakes[,pairs[2,x]])

result <- microbenchmark::microbenchmark(
  l <- sapply(pairs[1:4],mass_cor_test), times = 10
)
#> Warning in microbenchmark::microbenchmark(l <- sapply(pairs[1:4],
#> mass_cor_test), : less accurate nanosecond times to avoid potential integer
#> overflows
l
#>             [,1]                                           
#> statistic   -0.3972365                                     
#> parameter   98                                             
#> p.value     0.6920569                                      
#> estimate    -0.04009468                                    
#> null.value  0                                              
#> alternative "two.sided"                                    
#> method      "Pearson's product-moment correlation"         
#> data.name   "fakes[, pairs[1, x]] and fakes[, pairs[2, x]]"
#> conf.int    numeric,2                                      
#>             [,2]                                           
#> statistic   1.193027                                       
#> parameter   98                                             
#> p.value     0.235739                                       
#> estimate    0.1196482                                      
#> null.value  0                                              
#> alternative "two.sided"                                    
#> method      "Pearson's product-moment correlation"         
#> data.name   "fakes[, pairs[1, x]] and fakes[, pairs[2, x]]"
#> conf.int    numeric,2                                      
#>             [,3]                                           
#> statistic   -0.3972365                                     
#> parameter   98                                             
#> p.value     0.6920569                                      
#> estimate    -0.04009468                                    
#> null.value  0                                              
#> alternative "two.sided"                                    
#> method      "Pearson's product-moment correlation"         
#> data.name   "fakes[, pairs[1, x]] and fakes[, pairs[2, x]]"
#> conf.int    numeric,2                                      
#>             [,4]                                           
#> statistic   -1.143017                                      
#> parameter   98                                             
#> p.value     0.255816                                       
#> estimate    -0.1147002                                     
#> null.value  0                                              
#> alternative "two.sided"                                    
#> method      "Pearson's product-moment correlation"         
#> data.name   "fakes[, pairs[1, x]] and fakes[, pairs[2, x]]"
#> conf.int    numeric,2
result
#> Unit: microseconds
#>                                    expr     min      lq     mean   median
#>  l <- sapply(pairs[1:4], mass_cor_test) 209.797 213.897 1236.396 216.1725
#>       uq      max neval
#>  231.158 10390.96    10

^{Created on 2023-02-07 with reprex v2.0.2}

nirgrahamuk · February 7, 2023, 6:08pm

I think your timing shows your machine could likely do them all in 3.5 hours ?

1/(1236.396/10^6) # mean of 808 iterations per second

((choose(9000,2)/4 #each mass_cor_test that was timed did 4 pairs
  )/808 # iters per second
)/3600 # seconds in an hour

#3.5 hours ?

technocrat · February 8, 2023, 7:56am

What's an order of magnitude or two among friends?

Bine · February 8, 2023, 9:40am

Thank you very much.

My table is in a bit different format.
In your case it looks like as if I am doing it for one individual:
E.g. Gene 1 predicted against gene 1 real [1,] against [,1]:

#> [,1] [,2] [,3]
#> [1,] 1 1 1
#> [2,] 2 3 4

In my case there is further complexity because I have predicted vs. real for 300 individuals:

Individual | predicted  gene x| real gene x | predicted gene y | real gene y | predicted gene z | real gene z
1                   | 45                 | 34        | 34                 |   23                             | 34                  | 24                              | 23                                 
2                   |  42                | 19        | 23                 |    12                            |27                   |67                               |65 
.....                   ......               .......    ....                        ...                                 ...                      ...                                   ...

Any idea?

Thank you!!

nirgrahamuk · February 8, 2023, 11:15am

it seems you only want to compare corresponding gene info, this is only 9000 comparisons, and will take seconds not hours.
Here I simulate some data, and show that if you have your data in the same structure; you can apply this approach to get all the numbers.

library(tidyverse)
num_genes <- 9000
num_entries_per_gene <- 100 
fakes <- matrix(sample(19:45,num_genes*num_entries_per_gene*2,replace = TRUE),
                nrow = num_entries_per_gene, ncol = num_genes*2) 
colnames(fakes) <- paste0(c("pred","real"),rep(seq_len(num_genes),each=2))
fake_df <- tibble(as.data.frame(fakes))
# look at the fake data to understand the layout 
fake_df

#obviously instead of splitting this fake data you would split your real data in a similar fashion.
(pred_frame <- select(fake_df,starts_with("pred")))
(real_frame <- select(fake_df,starts_with("real")))
# understand the layout of the split frames


# we arent comparing every predicted gene with every real gene measure
#pairs <- combn(num_genes,2)
# rather we are comparing every predicted gene with its corresponding real gene measure
(pairs <- seq_len(num_genes))

results <- map(pairs, \(p_){cor.test(pred_frame[[p_]],real_frame[[p_]])})

# got a list of 9000 correlation results in a couple of seconds.

# i.e for gene pair 179 see the inputs
pred_frame[,179]
real_frame[,179]
# see the result
results[[179]]

Bine · February 8, 2023, 3:09pm

Amazing. Yes, this is exactly what I want to do.
I am just trying it right now with your test data before applying to my data.

It gives me this error with your test data:

> results <- map(pairs, \(p_){cor.test(pred_frame[[p_]],real_frame[[p_]])})
Error: unexpected input in "results <- map(pairs, \"

Probably a typo somewhere?

jrkrideau · February 8, 2023, 3:23pm

I am not a statistician but this many multiple correlations strikes me as analogous to the the infamous multiple t-test. Multiple comparisons problem - Wikipedia

It can be done but should it?

nirgrahamuk · February 8, 2023, 3:24pm

The syntax works in R 4.2.2
If you want to try on an older R version you could do the more verbose version

results <- map(pairs, function(p_){cor.test(pred_frame[[p_]],real_frame[[p_]])})

Bine · February 8, 2023, 3:53pm

This is working now. Thanks so much!!

Bine · February 8, 2023, 3:56pm

Good point.. I guess i should use something like an adjusted p-value, not nominal p-value.

jrkrideau · February 8, 2023, 4:15pm

That is my thought. As I say I am not a statistician but something like one-way ANOVA make sense? Otherwise maybe a visit to your organization's stats consulting service?

Bine · February 9, 2023, 9:36am

I have one more question, if that's ok.
I wonder how I can see which gene I am comparing between predicted and real expression?
Currrently it is only giving the gene pair number, but it would be good to know the gene name.

Thank you.
Bine

nirgrahamuk · February 9, 2023, 9:40am

names(pred_frame)[179]
names(real_frame)[179]

Bine · February 9, 2023, 9:45am

Awesome, thank you so much. This was extremely helpful.

The final task I have to do is putting all results[[1]] in one table.
Put I doubt it is possible to fetch the p-value and the tau correlation value for each of the 9000 genes?

Kendall's rank correlation tau

data: pred_frame[[p_]] and real_frame[[p_]]
z = -0.4714, p-value = 0.6374
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
-0.1271283

Sorry, this is really my last question!!!

technocrat · February 9, 2023, 10:32am

# Assume that the data represent actual/predicted
actual <- c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1)
predicted <- c(2.6, 3.1, 2.5, 5.0, 3.6, 4.0, 5.2, 2.8, 3.8)
# inspect the return object of class htest
a_result <- cor.test(actual,predicted, method = "kendall", alternative = "greater")
a_result$p.value
#> [1] 0.05971947
a_result$estimate
#>       tau 
#> 0.4444444

nirgrahamuk · February 9, 2023, 11:55am

results <- map_dfr(pairs, function(p_){
  ct <- cor.test(pred_frame[[p_]],
                       real_frame[[p_]]) 
    data.frame(predname=names(pred_frame)[[p_]],
               realame=names(real_frame)[[p_]],
               p.value=ct$p.value,
               estimate=ct$estimate)})

Bine · February 9, 2023, 1:46pm

Thanks so much, that's exactly what I was looking for!!!
I thought I wouldnt get this solved, so happy! You are a star!

system · February 16, 2023, 1:47pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.