Spearman correlation huge gene dataset

Hi guys,
I'm searching to make fast the computation of spearman correlation between 15.000 genes of my dataset, but i don't understand if put cor(brca_mutations, method="spearman") into a variable return me the correct calculus.. this is my dataset:

the columns of dataset represent the patients code, so how i can create a matrix of spearman correlation between all genes?! thanks so much.

# spearman cor for iris for only Petal.Width and Petal.Length
cor(iris$Petal.Width,iris$Petal.Length,method="spearman")

#find all combinations of the numeric variables
names_to_do <- names(select_if(iris,is.numeric))

(combinations_to_do <- t(combn(x = names_to_do , m = 2)))

library(slider)
#havign found all combinations, calculate and aggregate their correlations
slide_dfr(combinations_to_do,
      ~data.frame(var1 = .[1],
                  var2 = .[2],
                  cor_spear = cor(iris[[.[1]]],iris[[.[2]]],method="spearman")))

@nirgrahamuk, thank to reply this post! But, i've a question for you i change the last code in this way:

slide_dfr(combinations_to_do,
~data.frame(var1 = .[1],
var2 = .[2],
cor_spear = cor(brca_expressions[[.[1]]], brca_expressions[[.[2]]], method="spearman")))

i got this error:

Error in cor(brca_expressions[[.[1]]], brca_expressions[[.[2]]], method = "spearman") :
supply both 'x' and 'y' or a matrix-like 'x'

i don't know how to system it, remember i've my genes into the rows of my dataframe brca!

Hi @Jeremy98-alt,

Run this code and inspect the output:

m = matrix(data = rnorm(200), nrow = 20, ncol = 10,
           dimnames = list(paste0("gene_", 1:20),
                           paste0("patient_", 1:10)))
cor(x = m, method = "spearman")
cor(x = t(m), method = "spearman")

Hope it helps :slightly_smiling_face:

i want to create a correlation matrix between set of gene's expressions through each patient. In this case i can't do that exactly.

But, thanks to reply this post!

You were being advised to transpose your data with the t() function so that it's structure is like in my example

Run the code, inspect the output and you'll see I gave you exactly that :+1:

i don't understand this instruction:
cor_spear = cor(iris[[.[1]]],iris[[.[2]]]

in my case i've brca_expressions, i should catch every gene that is present in every line of combinations_to_do, for example:

there is a line which are present these genes A1B3 and X3423, i want to catch line correspond in brca_expressions and make the correlation between this set of data.

I dont understand what you dont understand...
could you provide some sample data ?

With respect, I can't parse this.

Ok, sorry for my bad english... @nirgrahamuk

so, i've create with your code a matrix of combinations between genes:

then, i'm searching to catch for each line of matrix combinations among genes:

2

the data set corresponding to the gene line of brca_expressions:

in this case i don't understand how to do that with your code, i should catch the rows A1BG and NAT2 and make the spearman correlation.

you are using:

cor_spear = cor(iris[[.[1]]],iris[[.[2]]],method="spearman")))

but i should change the iris[[.[1]]] and iris[[.[2]]] with that lines of brca_expression!

What about when you transpose your dataset with the t() function?

Create combinations_to_do matrix that contains 400 milion of rows

That is surprising, I expected only 112 million.
Google calculator 15000 choose 2

You'd need to find all pairs between 28,000 possibilities to get close to 400million I would have thought.... Very puzzling.

yes @nirgrahamuk , sorry ahah brca_expression have 400 milion of combinations because that dataset has around 20.000 genes whereas the brca_mutation has around 15,000 genes ahah however how to do change your code respect my request?

I dont understand your request.
Can you provide some 'small' example data, to demonstrate your issue.
you can take your actual data and use dplyr verbs like filter(), select() , slice() to reduce it in various ways to construct a transferable example dataset that you can then communicate with dput(), and then phrase your question in relation to that.... please

my main problem is that for 400 milion of rows i can process in one hour 100 thousand of rows ahah so you understand that is too even little, so i want to understand if i can use your code to increase the speed of the process!

I want to catch for each line of the matrix combination that i created with your code, the corrisponding gene rows in brca_expression.

look the images!

seems unlikely, maybe theres a little inefficiency here from using data.frames when matrices would do, etc.etc. but im skeptical that we can produce R code thats 10 times faster. and 10 times faster would be what, 16 days runtime, compared to 166 ....

Oh god... i should have a data center, no? like Azure o Amazon AWS

cor(x = t(m), method = "spearman") is the solution :sweat_smile: thanks a lot

Which I literally wrote 1:1 in post 4 in this thread :+1: