Grouped correlation for more than two variables

romanb333 · December 25, 2017, 6:21pm

Hi,

My name is Roman and I'm new to this group. I'm working with the mtcars dataset and was wondering how to get correlations for multiple variables by group using tidyverse functions. For example, I can get correlations for two variables like below, but I don't know how to do it for more than two or even all the variables in the dataset.

I'd like to be able to see correlations for any number of selected variables by group i.e. if I wanted to see the correlation stats between mpg, wt, and disp grouped by cyl for example.

mtcars %>% 
  group_by(cyl) %>% 
  summarise(cor = cor(mpg, wt))

#> # A tibble: 3 x 2
#>     cyl        cor
#>   <dbl>      <dbl>
#> 1     4 -0.7131848
#> 2     6 -0.6815498
#> 3     8 -0.6503580

Thanks all for your time and help.

jrlewi · December 25, 2017, 6:27pm

Hi Roman - can you provide the movies data.frame? Actually, just a subset of it would help. Please see https://www.tidyverse.org/help/ for tips on making reproducible examples that will help the community answer your question

romanb333 · December 25, 2017, 8:01pm

Hi John - thanks for your reply. I updated my question to reference the mtcars data. The movie data I asked about is no longer relevant. Sorry for the confusion.

jrlewi · December 25, 2017, 8:51pm

How about this? Makes use of the map functions...for a good overview see chapter 21.5 of the 'R for Data Science' book (http://r4ds.had.co.nz/). They are very similar to apply/sapply/lapply family of functions if you are familar with those - essential they iteratively apply functions to the elements of a list.

suppressPackageStartupMessages(library(tidyverse))

# All correlations by for each cycle level
all <- mtcars %>% 
split(.$cyl) %>% 
map(cor)
#> Warning in .f(.x[[i]], ...): the standard deviation is zero
#> Warning in .f(.x[[i]], ...): the standard deviation is zero
#> Warning in .f(.x[[i]], ...): the standard deviation is zero
all[[1]][1:3, 1:3]
#>             mpg cyl       disp
#> mpg   1.0000000  NA -0.8052361
#> cyl          NA   1         NA
#> disp -0.8052361  NA  1.0000000

# dropping cyl to avoid cor(cyl, other_variable) within each split

all_2 <- mtcars %>%
 split(.$cyl) %>% 
map(select, -c(cyl)) %>% 
map(cor)
#> Warning in .f(.x[[i]], ...): the standard deviation is zero
all_2[[1]][1:3, 1:3]
#>             mpg       disp         hp
#> mpg   1.0000000 -0.8052361 -0.5235034
#> disp -0.8052361  1.0000000  0.4346051
#> hp   -0.5235034  0.4346051  1.0000000

# similarly for just vars you care about calculating the correlation for ---
vars_keep <- names(mtcars)[c(1, 3, 4)]
some <- mtcars %>% 
split(.$cyl) %>% 
map(select, vars_keep) %>% 
map(cor)
some[[1]]
#>             mpg       disp         hp
#> mpg   1.0000000 -0.8052361 -0.5235034
#> disp -0.8052361  1.0000000  0.4346051
#> hp   -0.5235034  0.4346051  1.0000000

romanb333 · December 25, 2017, 9:37pm

Exactly what I was looking for, thanks so much for the quick reply.

A follow up - how would you recommend going about converting the resulting lists into something I could put into a visual? i.e. either a correlation matrix or something in ggplot.

Thanks again.

jrlewi · December 25, 2017, 10:02pm

How about...

suppressPackageStartupMessages(library(tidyverse))

vars_keep <- names(mtcars)[c(1, 3, 4)]
some <- mtcars %>% split(.$cyl) %>% map(select, vars_keep) %>% map(cor)

df <- some %>% reshape2::melt() %>% rename(cyl = L1)
ggplot(df, aes(x = Var1, y = Var2, fill = value)) + geom_tile() + facet_wrap(~cyl, 
  nrow = 1)

Note - there is likely a way to do this with tidyr functions instead of reshape2::melt. I am still more familiar with reshape2 but I will think about a tidyr version (unless someone beats me to it). For now - it's time for pie!

John

romanb333 · December 25, 2017, 10:14pm

This is awesome, you rock!!!