Correlation Matrix for all variables of lm model

Dear all,

I am trying since a while now to find a way to create a correlation matrix of all variables of my linear lm model. So far I have only found a way to do it with two variables . For the lmer model i got it automatically if i put summary(model) but for my lm model it doesnt give it ...

Thank you for any idea!!

If you have a dataframe (say it's called sample_df) with all numeric columns, then simply this will give you a correlation matrix.

cor(sample_df)

If there are non-numerical columns as well, pass to the function after selecting required columns or after dropping unnecessary columns.

Hope this helps.

1 Like

head(mtcars)

lm1 <- lm(mpg ~ cyl + hp + gear,data=mtcars)

(inputnames <- intersect(names(mtcars),
                        names(lm1$coefficients)))

cor(subset(mtcars,select=inputnames))
1 Like

Thank you, unfortunately it gives me this:

cor(subset(df,select=inputnames))
<0 x 0 matrix>

My dataframe has numeric and non-numeric variables. Is that a problem?

Thanks so much!

How would you think about the correlation of non-numeric variables?

1 Like

I had non-numeric variables in my lmer model and the summary (lm) would should me the correlation matrix also of non-numerical variables. Not sure how it is done, but it calculated it somehow.

(1) Correlation of non-numeric variables doesn't make any sense.
(2) When I ran the code you posted, it ran fine with no error messages and no non-numeric variables. It's likely the problem is in something else you're doing. Try rm(list=ls()) and then re-running.

1 Like

(1)In my lmer model I got the correlation of non-numeric variables which made sense, e.g.:

>                  Batch 1    Under 50    Under 70       Sample Site 1
> Batch 1             x         0.45        0.45          0.8
Under 50             0.3        x           0.67          0.4
......

These are all categorical, non numeric variables. Also non-numeric variables can be correlated, i dont understand what you mean.

(2) I didnt post my code. Nirgrahamuk gave an example of numeric variables.

Sorry, you're right about the code of course. I was looking at @nirgrahamuk 's

Non-numeric variables can be correlated according to their order if an order is somehow specified, but the variables themselves cannot be correlated.

Perhaps post the code that gives the results you shown. Even better, a reprex.

1 Like

The results are just an example of summary(model) of my mixed linear regression model:

model <-lmer(Expression ~ Batch + AGE.Group + Sample.Site +Gender (1|ID) ,data=df)

and then summary(model) it gives me a nice correlation matrix for all variables as in my example above.

Now with a "basic" linear regression model

model <-lm(Expression ~ Batch + AGE.Group + Sample.Site +Gender ,data=df) it doesnt give me a correlation matrix with summary(model) and I just cant find a way to get it since I have non-numeric, categorical variables.

not sure whether you should do this, but you can get the integer representation of factors, and get the correlations based on that manipulation.

head(iris)

lm1 <- lm(Petal.Length ~  Sepal.Length +Sepal.Width+Petal.Width+Species ,data=iris)

(inputnames <- intersect(names(iris),
                         names(dummy.coef.lm(lm1))))
library(dplyr)

numiris <- mutate(subset(iris,select=inputnames),across(.fns=as.numeric))
cor(subset(numiris,select=inputnames),method = "spearman")

I believe that what you are seeing is the correlation matrix of the estimated coefficients, no the correlation of the variables. I could be wrong though.

thank you nigrahamuk! I will try it later!

@startz I just checked again what it says under summary of my lmer model.
It says "Correlation of Fixed Effects". I thought that meant the correlation of the variables?!

I believe it is the correlation of the coefficients, though I'm not sure. Here's what the documentation says,

correlation
(logical) for vcov, indicates whether the correlation matrix as well as the variance-covariance matrix is desired; for summary.merMod, indicates whether the correlation matrix should be computed and stored along with the covariance; for print.summary.merMod, indicates whether the correlation matrix of the fixed-effects parameters should be printed. In the latter case, when NULL (the default), the correlation matrix is printed when it has been computed by summary(.), and when p <= 20.

Non-numeric variables don't have correlations. But the dummy variables used to estimate fixed effects are numeric. If you have a variables that only takes on two values, then you get one dummy variable for each and the correlations of those dummy variables can be checked. If your variables take on more than two values you get more than one dummy for each. The correlations of the dummies can be calculated, but one would have to be very careful thinking about their meanings.

1 Like

Thank you @startz
Makes sense with the dummy variables that i cannot calculate the correlation of the non-numeric but that i can calculate it when i have dummy variables.
I am not quite sure what @nirgrahamuk solution does, it is somehow converting non-numeric to numeric variables but these are not dummy variables?!
Maybe to solve my issue i need to check how to create dummy variables in R and then i should be able to run this as posted by @nirgrahamuk

lm1 <- lm(mpg ~ cyl + hp + gear,data=mtcars)

(inputnames <- intersect(names(mtcars),
names(lm1$coefficients)))

cor(subset(mtcars,select=inputnames))

Thank you all!

You might want to take a look at the correlationfunnel package. It works well with categorical variables.
Speed Up Exploratory Data Analysis (EDA) with the Correlation Funnel • correlationfunnel (business-science.github.io)

1 Like

Thank you. Nice package.
I run it unfortunately it seems not to work for me because i need to put as a target my gene expression which is not categorical. And it looks like it wants to have one feature of a categorical variable as a target.

I have found that the corrr package is a really nice package for correlation.