Hello,
I want to get some help here on how to do correlation from two data sets?
Both data set have the same number of observations. I want to retain a correlation coefficient values higher than 0.5 in the final matrix.
Here I created a simulation data set and the code I tried for correlation.
Thank you so much!
I'm not totally sure I understand what you're wanting to do, but if you're trying to evaluate the correlation amongst both sets of predictors in A and B, and then keep only correlations larger than 0.5, this should do it:
@mattwarkentin; thank you for your answer. However, what I what to do is a correlation between observations of data set A and observations of data set B. Let say observations in data set A are genes of tissue A and observations in data set B are genes of Tissue B. So I want to do a correlation between genes tissues A and genes of tissue B.
Best,
Amare
My understanding is that you want to correlate column 1 of A and column 1 of B, and so forth. I think you'll find that the approach I showed before achieves exactly this, plus it provides all other pairwirse correlations, which you can choose to ignore.
If you truly only want the column-wise correlations, see # Column-wise correlations below. However, note that the 5 column-wise values are included in the full table. For example, the pairwise correlation of M_1 vs. I_1 is row 6 in the table, and the first value in the vector of column-wise correlations.
A <- data.frame(rnorm(10000),
rnorm(10000),
rnorm(10000),
rnorm(10000),
rnorm(10000))
row.names(A) <- paste0("G_", 1:10000)
colnames(A) <- paste0("M_", 1:5)
B <- data.frame(rnorm(10000),
rnorm(10000),
rnorm(10000),
rnorm(10000),
rnorm(10000))
row.names(B) <- paste0("g_", 1:10000)
colnames(B) <- paste0("I_", 1:5)
library(dplyr)
library(tidyr)
library(corrr)
# Pair-wise correlations
bind_cols(A, B) %>%
correlate() %>%
stretch(remove.dups = TRUE)
#>
#> Correlation method: 'pearson'
#> Missing treated using: 'pairwise.complete.obs'
#> # A tibble: 55 x 3
#> x y r
#> <chr> <chr> <dbl>
#> 1 M_1 M_1 NA
#> 2 M_1 M_2 0.00328
#> 3 M_1 M_3 0.00822
#> 4 M_1 M_4 -0.00157
#> 5 M_1 M_5 -0.0137
#> 6 M_1 I_1 0.00345
#> 7 M_1 I_2 0.00611
#> 8 M_1 I_3 -0.00776
#> 9 M_1 I_4 0.000908
#> 10 M_1 I_5 -0.0144
#> # … with 45 more rows
# Column-wise correlations
purrr::map2_dbl(A, B, cor) %>%
setNames(nm = paste0(names(A), " vs. ", names(B)))
#> M_1 vs. I_1 M_2 vs. I_2 M_3 vs. I_3 M_4 vs. I_4 M_5 vs. I_5
#> 0.003447392 -0.011436215 0.003282041 -0.002950399 0.018201852
Okay let me put this way.
Let me transpose the data set as follow for you. Therefore, the genes(rows) are now columns.
a <- t(A)
b <- t(B)
now, I want correlation between a and b.
Thanks!
The fastest route to getting good help is to be very specific about the problem you're trying to solve. I am still not completely clear on what you want to achieve, but it seems like:
A and B are datasets that contain as many rows as genes and as many columns as samples.
The rows in A and B represent a common set of genes, but measured in different tissues.
The columns represent measurements in the same 5 samples in both A and B.
You want the correlation among the set of genes in A and B.
@mattwarkentin, yes exactly! Thank you so much! Can you show how to filter values greater than some threshold? let say I want to keep only those genes which are highly correlated (values higher or equal to the absolute values of 0.6).
Best,
Amare