Correlation matrix

amare · October 7, 2020, 2:05pm

Hello,
I want to get some help here on how to do correlation from two data sets?
Both data set have the same number of observations. I want to retain a correlation coefficient values higher than 0.5 in the final matrix.
Here I created a simulation data set and the code I tried for correlation.
Thank you so much!

A <- data.frame(rnorm(10000), 
                rnorm(10000),
                rnorm(10000), 
                rnorm(10000),
                rnorm(10000))
row.names(A) <- paste0("G_", 1:10000)
colnames(A) <- paste0("M_", 1:5)

B <- data.frame(rnorm(10000), 
                rnorm(10000),
                rnorm(10000), 
                rnorm(10000),
                rnorm(10000))
row.names(B) <- paste0("g_", 1:10000)
colnames(B) <- paste0("I_", 1:5)
cor.ge.AB <- cor(t(A),t(B))

mattwarkentin · October 7, 2020, 2:57pm

I'm not totally sure I understand what you're wanting to do, but if you're trying to evaluate the correlation amongst both sets of predictors in A and B, and then keep only correlations larger than 0.5, this should do it:

A <- data.frame(rnorm(10000), 
                rnorm(10000),
                rnorm(10000), 
                rnorm(10000),
                rnorm(10000))
row.names(A) <- paste0("G_", 1:10000)
colnames(A) <- paste0("M_", 1:5)

B <- data.frame(rnorm(10000), 
                rnorm(10000),
                rnorm(10000), 
                rnorm(10000),
                rnorm(10000))
row.names(B) <- paste0("g_", 1:10000)
colnames(B) <- paste0("I_", 1:5)

library(dplyr)
library(corrr)

bind_cols(A, B) %>%
  correlate() %>% 
  stretch(remove.dups = TRUE)
#> 
#> Correlation method: 'pearson'
#> Missing treated using: 'pairwise.complete.obs'
#> # A tibble: 55 x 3
#>    x     y             r
#>    <chr> <chr>     <dbl>
#>  1 M_1   M_1   NA       
#>  2 M_1   M_2   -0.000700
#>  3 M_1   M_3   -0.00288 
#>  4 M_1   M_4    0.000876
#>  5 M_1   M_5   -0.0122  
#>  6 M_1   I_1    0.00616 
#>  7 M_1   I_2   -0.0131  
#>  8 M_1   I_3   -0.00361 
#>  9 M_1   I_4   -0.00454 
#> 10 M_1   I_5   -0.00332 
#> # … with 45 more rows

bind_cols(A, B) %>%
  correlate() %>% 
  stretch(remove.dups = TRUE) %>% 
  filter(r > 0.5)
#> 
#> Correlation method: 'pearson'
#> Missing treated using: 'pairwise.complete.obs'
#> # A tibble: 0 x 3
#> # … with 3 variables: x <chr>, y <chr>, r <dbl>

^{Created on 2020-10-07 by the reprex package (v0.3.0)}

amare · October 7, 2020, 4:29pm

@mattwarkentin; thank you for your answer. However, what I what to do is a correlation between observations of data set A and observations of data set B. Let say observations in data set A are genes of tissue A and observations in data set B are genes of Tissue B. So I want to do a correlation between genes tissues A and genes of tissue B.
Best,
Amare

mattwarkentin · October 7, 2020, 4:51pm

My understanding is that you want to correlate column 1 of A and column 1 of B, and so forth. I think you'll find that the approach I showed before achieves exactly this, plus it provides all other pairwirse correlations, which you can choose to ignore.

If you truly only want the column-wise correlations, see # Column-wise correlations below. However, note that the 5 column-wise values are included in the full table. For example, the pairwise correlation of M_1 vs. I_1 is row 6 in the table, and the first value in the vector of column-wise correlations.

A <- data.frame(rnorm(10000), 
                rnorm(10000),
                rnorm(10000), 
                rnorm(10000),
                rnorm(10000))
row.names(A) <- paste0("G_", 1:10000)
colnames(A) <- paste0("M_", 1:5)

B <- data.frame(rnorm(10000), 
                rnorm(10000),
                rnorm(10000), 
                rnorm(10000),
                rnorm(10000))
row.names(B) <- paste0("g_", 1:10000)
colnames(B) <- paste0("I_", 1:5)

library(dplyr)
library(tidyr)
library(corrr)

# Pair-wise correlations
bind_cols(A, B) %>%
  correlate() %>% 
  stretch(remove.dups = TRUE)
#> 
#> Correlation method: 'pearson'
#> Missing treated using: 'pairwise.complete.obs'
#> # A tibble: 55 x 3
#>    x     y             r
#>    <chr> <chr>     <dbl>
#>  1 M_1   M_1   NA       
#>  2 M_1   M_2    0.00328 
#>  3 M_1   M_3    0.00822 
#>  4 M_1   M_4   -0.00157 
#>  5 M_1   M_5   -0.0137  
#>  6 M_1   I_1    0.00345 
#>  7 M_1   I_2    0.00611 
#>  8 M_1   I_3   -0.00776 
#>  9 M_1   I_4    0.000908
#> 10 M_1   I_5   -0.0144  
#> # … with 45 more rows

# Column-wise correlations
purrr::map2_dbl(A, B, cor) %>% 
  setNames(nm = paste0(names(A), " vs. ", names(B)))
#>  M_1 vs. I_1  M_2 vs. I_2  M_3 vs. I_3  M_4 vs. I_4  M_5 vs. I_5 
#>  0.003447392 -0.011436215  0.003282041 -0.002950399  0.018201852

amare · October 7, 2020, 5:09pm

Okay let me put this way.
Let me transpose the data set as follow for you. Therefore, the genes(rows) are now columns.
a <- t(A)
b <- t(B)
now, I want correlation between a and b.
Thanks!

mattwarkentin · October 7, 2020, 6:35pm

The fastest route to getting good help is to be very specific about the problem you're trying to solve. I am still not completely clear on what you want to achieve, but it seems like:

A and B are datasets that contain as many rows as genes and as many columns as samples.
The rows in A and B represent a common set of genes, but measured in different tissues.
The columns represent measurements in the same 5 samples in both A and B.
You want the correlation among the set of genes in A and B.

If these assumptions are true:

library(tidyverse)

A <- data.frame(rnorm(10), 
                rnorm(10),
                rnorm(10), 
                rnorm(10),
                rnorm(10))
row.names(A) <- paste0("G_", 1:10)
colnames(A) <- paste0("I_", 1:5)

B <- data.frame(rnorm(10), 
                rnorm(10),
                rnorm(10), 
                rnorm(10),
                rnorm(10))
row.names(B) <- paste0("g_", 1:10)
colnames(B) <- paste0("I_", 1:5)

At <- as_tibble(t(A))
Bt <- as_tibble(t(B))

bind_cols(At, Bt) %>% 
  corrr::correlate()
#> 
#> Correlation method: 'pearson'
#> Missing treated using: 'pairwise.complete.obs'
#> # A tibble: 20 x 21
#>    rowname      G_1      G_2     G_3     G_4     G_5     G_6     G_7     G_8
#>    <chr>      <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#>  1 G_1     NA       -0.505   -0.144  -0.794   0.276  -0.140   0.814  -0.554 
#>  2 G_2     -0.505   NA        0.225   0.139  -0.695  -0.589  -0.0743  0.592 
#>  3 G_3     -0.144    0.225   NA       0.530  -0.236  -0.372  -0.187   0.228 
#>  4 G_4     -0.794    0.139    0.530  NA      -0.216   0.338  -0.747   0.530 
#>  5 G_5      0.276   -0.695   -0.236  -0.216  NA       0.0704 -0.302  -0.921 
#>  6 G_6     -0.140   -0.589   -0.372   0.338   0.0704 NA      -0.142   0.204 
#>  7 G_7      0.814   -0.0743  -0.187  -0.747  -0.302  -0.142  NA      -0.0194
#>  8 G_8     -0.554    0.592    0.228   0.530  -0.921   0.204  -0.0194 NA     
#>  9 G_9      0.711   -0.0301   0.396  -0.551   0.190  -0.731   0.506  -0.509 
#> 10 G_10    -0.00838 -0.109   -0.376   0.0681 -0.541   0.724   0.368   0.613 
#> 11 g_1     -0.607    0.104    0.103   0.743  -0.558   0.682  -0.287   0.820 
#> 12 g_2      0.660   -0.753    0.364  -0.0812  0.364   0.199   0.316  -0.405 
#> 13 g_3      0.287   -0.616   -0.208  -0.263   0.992  -0.0516 -0.288  -0.946 
#> 14 g_4      0.656   -0.598   -0.167  -0.288  -0.106   0.544   0.695   0.0363
#> 15 g_5     -0.0578   0.00274 -0.630  -0.420   0.595  -0.230  -0.274  -0.600 
#> 16 g_6     -0.553    0.157   -0.0420  0.260   0.524  -0.255  -0.822  -0.361 
#> 17 g_7     -0.887    0.694   -0.175   0.447  -0.411   0.0124 -0.552   0.576 
#> 18 g_8      0.158    0.0808  -0.931  -0.664   0.0582  0.0485  0.309  -0.159 
#> 19 g_9     -0.408    0.382    0.716   0.437   0.142  -0.622  -0.595  -0.108 
#> 20 g_10    -0.498   -0.118   -0.718   0.210   0.0735  0.692  -0.395   0.190 
#> # … with 12 more variables: G_9 <dbl>, G_10 <dbl>, g_1 <dbl>, g_2 <dbl>,
#> #   g_3 <dbl>, g_4 <dbl>, g_5 <dbl>, g_6 <dbl>, g_7 <dbl>, g_8 <dbl>,
#> #   g_9 <dbl>, g_10 <dbl>

# Same as bottom-left quadrant of the above result
cor(At, Bt) %>%
  t() %>% 
  as_tibble()
#> # A tibble: 10 x 10
#>        G_1      G_2     G_3     G_4     G_5     G_6    G_7     G_8     G_9
#>      <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>   <dbl>
#>  1 -0.607   0.104    0.103   0.743  -0.558   0.682  -0.287  0.820  -0.791 
#>  2  0.660  -0.753    0.364  -0.0812  0.364   0.199   0.316 -0.405   0.486 
#>  3  0.287  -0.616   -0.208  -0.263   0.992  -0.0516 -0.288 -0.946   0.270 
#>  4  0.656  -0.598   -0.167  -0.288  -0.106   0.544   0.695  0.0363  0.0713
#>  5 -0.0578  0.00274 -0.630  -0.420   0.595  -0.230  -0.274 -0.600  -0.0443
#>  6 -0.553   0.157   -0.0420  0.260   0.524  -0.255  -0.822 -0.361  -0.173 
#>  7 -0.887   0.694   -0.175   0.447  -0.411   0.0124 -0.552  0.576  -0.690 
#>  8  0.158   0.0808  -0.931  -0.664   0.0582  0.0485  0.309 -0.159  -0.216 
#>  9 -0.408   0.382    0.716   0.437   0.142  -0.622  -0.595 -0.108   0.312 
#> 10 -0.498  -0.118   -0.718   0.210   0.0735  0.692  -0.395  0.190  -0.911 
#> # … with 1 more variable: G_10 <dbl>

amare · October 7, 2020, 7:11pm

@mattwarkentin, yes exactly! Thank you so much! Can you show how to filter values greater than some threshold? let say I want to keep only those genes which are highly correlated (values higher or equal to the absolute values of 0.6).
Best,
Amare

system · October 14, 2020, 7:11pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.