# descriptives for levels of multiple categorical variables (factor) versus a metric variable - best practice?

To explore the relationship between multiple categorical variables (factors) and a metric variable, I would like to calculate various descriptive statistics (mean, median, IQR etc.) of the metric variable for each factor level of each variable (e.g. multiple factors of an experiment vs. the outcome variable).

My question: what is the best (most straightforward, elegant) way to do this in (tidyverse) R? Or is there a well maintained package to do this?

Two approaches with a reprex (I use the `diamonds` data, with `cut` and `color` as categorical variables and `carat` as the metric variable):

My first approach is, using `group_by` and `summarize` and copy + paste it for each factor, and then `bind_rows`

``````library(tidyverse)

summary_cut <- diamonds %>%
group_by(cut) %>%
summarize(
name = "cut",
carat_mean = mean(carat),
carat_median = median(carat)) %>%
select(name, value = cut, everything()) %>%
mutate(value = as.character(value))

summary_color <- diamonds %>%
group_by(color) %>%
summarize(
name = "color",
carat_mean = mean(carat),
carat_median = median(carat)) %>%
select(name, value = color, everything()) %>%
mutate(value = as.character(value))

bind_rows(summary_cut, summary_color)
#> # A tibble: 12 x 4
#>    name  value     carat_mean carat_median
#>    <chr> <chr>          <dbl>        <dbl>
#>  1 cut   Fair           1.05          1
#>  2 cut   Good           0.849         0.82
#>  3 cut   Very Good      0.806         0.71
#>  4 cut   Premium        0.892         0.86
#>  5 cut   Ideal          0.703         0.54
#>  6 color D              0.658         0.53
#>  7 color E              0.658         0.53
#>  8 color F              0.737         0.7
#>  9 color G              0.771         0.7
#> 10 color H              0.912         0.9
#> 11 color I              1.03          1
#> 12 color J              1.16          1.11
``````

Created on 2020-04-13 by the reprex package (v0.3.0)

My second more flexible approach, without copy + paste, is using `map` with a formula which calls `group_by` and `summarize` and afterwards `bind_rows`

``````library(tidyverse)

list("cut", "color") %>%
map(~ diamonds %>%
group_by_at(.x) %>%
summarize(name = .x,
carat_mean = mean(carat),
carat_median = median(carat)) %>%
select(name, value = !!.x, everything()) %>%
mutate(value = as.character(value))) %>%
bind_rows()
#> # A tibble: 12 x 4
#>    name  value     carat_mean carat_median
#>    <chr> <chr>          <dbl>        <dbl>
#>  1 cut   Fair           1.05          1
#>  2 cut   Good           0.849         0.82
#>  3 cut   Very Good      0.806         0.71
#>  4 cut   Premium        0.892         0.86
#>  5 cut   Ideal          0.703         0.54
#>  6 color D              0.658         0.53
#>  7 color E              0.658         0.53
#>  8 color F              0.737         0.7
#>  9 color G              0.771         0.7
#> 10 color H              0.912         0.9
#> 11 color I              1.03          1
#> 12 color J              1.16          1.11
``````

Created on 2020-04-13 by the reprex package (v0.3.0)

However, are there better ways to do this? For beginners the second approach might be too complicated because it requires knowledge about `purrr::map()` and the `!!` operator.
Or is there a well maintained package for this type of analysis? (e. g. like `skimer`, but able to do this type of analysis)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.