My name is Roman and I'm new to this group. I'm working with the mtcars dataset and was wondering how to get correlations for multiple variables by group using tidyverse functions. For example, I can get correlations for two variables like below, but I don't know how to do it for more than two or even all the variables in the dataset.
I'd like to be able to see correlations for any number of selected variables by group i.e. if I wanted to see the correlation stats between mpg, wt, and disp grouped by cyl for example.
Hi Roman - can you provide the movies data.frame? Actually, just a subset of it would help. Please see https://www.tidyverse.org/help/ for tips on making reproducible examples that will help the community answer your question
Hi John - thanks for your reply. I updated my question to reference the mtcars data. The movie data I asked about is no longer relevant. Sorry for the confusion.
How about this? Makes use of the map functions...for a good overview see chapter 21.5 of the 'R for Data Science' book (http://r4ds.had.co.nz/). They are very similar to apply/sapply/lapply family of functions if you are familar with those - essential they iteratively apply functions to the elements of a list.
suppressPackageStartupMessages(library(tidyverse))
# All correlations by for each cycle level
all <- mtcars %>%
split(.$cyl) %>%
map(cor)
#> Warning in .f(.x[[i]], ...): the standard deviation is zero
#> Warning in .f(.x[[i]], ...): the standard deviation is zero
#> Warning in .f(.x[[i]], ...): the standard deviation is zero
all[[1]][1:3, 1:3]
#> mpg cyl disp
#> mpg 1.0000000 NA -0.8052361
#> cyl NA 1 NA
#> disp -0.8052361 NA 1.0000000
# dropping cyl to avoid cor(cyl, other_variable) within each split
all_2 <- mtcars %>%
split(.$cyl) %>%
map(select, -c(cyl)) %>%
map(cor)
#> Warning in .f(.x[[i]], ...): the standard deviation is zero
all_2[[1]][1:3, 1:3]
#> mpg disp hp
#> mpg 1.0000000 -0.8052361 -0.5235034
#> disp -0.8052361 1.0000000 0.4346051
#> hp -0.5235034 0.4346051 1.0000000
# similarly for just vars you care about calculating the correlation for ---
vars_keep <- names(mtcars)[c(1, 3, 4)]
some <- mtcars %>%
split(.$cyl) %>%
map(select, vars_keep) %>%
map(cor)
some[[1]]
#> mpg disp hp
#> mpg 1.0000000 -0.8052361 -0.5235034
#> disp -0.8052361 1.0000000 0.4346051
#> hp -0.5235034 0.4346051 1.0000000
Exactly what I was looking for, thanks so much for the quick reply.
A follow up - how would you recommend going about converting the resulting lists into something I could put into a visual? i.e. either a correlation matrix or something in ggplot.
Note - there is likely a way to do this with tidyr functions instead of reshape2::melt. I am still more familiar with reshape2 but I will think about a tidyr version (unless someone beats me to it). For now - it's time for pie!