To explore the relationship between multiple categorical variables (factors) and a metric variable, I would like to calculate various descriptive statistics (mean, median, IQR etc.) of the metric variable for each factor level of each variable (e.g. multiple factors of an experiment vs. the outcome variable).
My question: what is the best (most straightforward, elegant) way to do this in (tidyverse) R? Or is there a well maintained package to do this?
Two approaches with a reprex (I use the diamonds
data, with cut
and color
as categorical variables and carat
as the metric variable):
My first approach is, using group_by
and summarize
and copy + paste it for each factor, and then bind_rows
library(tidyverse)
summary_cut <- diamonds %>%
group_by(cut) %>%
summarize(
name = "cut",
carat_mean = mean(carat),
carat_median = median(carat)) %>%
select(name, value = cut, everything()) %>%
mutate(value = as.character(value))
summary_color <- diamonds %>%
group_by(color) %>%
summarize(
name = "color",
carat_mean = mean(carat),
carat_median = median(carat)) %>%
select(name, value = color, everything()) %>%
mutate(value = as.character(value))
bind_rows(summary_cut, summary_color)
#> # A tibble: 12 x 4
#> name value carat_mean carat_median
#> <chr> <chr> <dbl> <dbl>
#> 1 cut Fair 1.05 1
#> 2 cut Good 0.849 0.82
#> 3 cut Very Good 0.806 0.71
#> 4 cut Premium 0.892 0.86
#> 5 cut Ideal 0.703 0.54
#> 6 color D 0.658 0.53
#> 7 color E 0.658 0.53
#> 8 color F 0.737 0.7
#> 9 color G 0.771 0.7
#> 10 color H 0.912 0.9
#> 11 color I 1.03 1
#> 12 color J 1.16 1.11
Created on 2020-04-13 by the reprex package (v0.3.0)
My second more flexible approach, without copy + paste, is using map
with a formula which calls group_by
and summarize
and afterwards bind_rows
library(tidyverse)
list("cut", "color") %>%
map(~ diamonds %>%
group_by_at(.x) %>%
summarize(name = .x,
carat_mean = mean(carat),
carat_median = median(carat)) %>%
select(name, value = !!.x, everything()) %>%
mutate(value = as.character(value))) %>%
bind_rows()
#> # A tibble: 12 x 4
#> name value carat_mean carat_median
#> <chr> <chr> <dbl> <dbl>
#> 1 cut Fair 1.05 1
#> 2 cut Good 0.849 0.82
#> 3 cut Very Good 0.806 0.71
#> 4 cut Premium 0.892 0.86
#> 5 cut Ideal 0.703 0.54
#> 6 color D 0.658 0.53
#> 7 color E 0.658 0.53
#> 8 color F 0.737 0.7
#> 9 color G 0.771 0.7
#> 10 color H 0.912 0.9
#> 11 color I 1.03 1
#> 12 color J 1.16 1.11
Created on 2020-04-13 by the reprex package (v0.3.0)
However, are there better ways to do this? For beginners the second approach might be too complicated because it requires knowledge about purrr::map()
and the !!
operator.
Or is there a well maintained package for this type of analysis? (e. g. like skimer
, but able to do this type of analysis)