descriptives for levels of multiple categorical variables (factor) versus a metric variable - best practice?

joel.gautschi · April 13, 2020, 9:15am

To explore the relationship between multiple categorical variables (factors) and a metric variable, I would like to calculate various descriptive statistics (mean, median, IQR etc.) of the metric variable for each factor level of each variable (e.g. multiple factors of an experiment vs. the outcome variable).

My question: what is the best (most straightforward, elegant) way to do this in (tidyverse) R? Or is there a well maintained package to do this?

Two approaches with a reprex (I use the diamonds data, with cut and color as categorical variables and carat as the metric variable):

My first approach is, using group_by and summarize and copy + paste it for each factor, and then bind_rows

library(tidyverse)

summary_cut <- diamonds %>% 
  group_by(cut) %>% 
  summarize(
    name = "cut",
    carat_mean = mean(carat), 
    carat_median = median(carat)) %>%
  select(name, value = cut, everything()) %>%
  mutate(value = as.character(value))

summary_color <- diamonds %>% 
  group_by(color) %>% 
  summarize(
    name = "color",
    carat_mean = mean(carat), 
    carat_median = median(carat)) %>%
  select(name, value = color, everything()) %>%
  mutate(value = as.character(value))

bind_rows(summary_cut, summary_color)
#> # A tibble: 12 x 4
#>    name  value     carat_mean carat_median
#>    <chr> <chr>          <dbl>        <dbl>
#>  1 cut   Fair           1.05          1   
#>  2 cut   Good           0.849         0.82
#>  3 cut   Very Good      0.806         0.71
#>  4 cut   Premium        0.892         0.86
#>  5 cut   Ideal          0.703         0.54
#>  6 color D              0.658         0.53
#>  7 color E              0.658         0.53
#>  8 color F              0.737         0.7 
#>  9 color G              0.771         0.7 
#> 10 color H              0.912         0.9 
#> 11 color I              1.03          1   
#> 12 color J              1.16          1.11

^{Created on 2020-04-13 by the reprex package (v0.3.0)}

My second more flexible approach, without copy + paste, is using map with a formula which calls group_by and summarize and afterwards bind_rows

library(tidyverse)

list("cut", "color") %>%
  map(~ diamonds %>% 
        group_by_at(.x) %>%
        summarize(name = .x,
                  carat_mean = mean(carat), 
                  carat_median = median(carat)) %>%
        select(name, value = !!.x, everything()) %>%
        mutate(value = as.character(value))) %>%
  bind_rows()
#> # A tibble: 12 x 4
#>    name  value     carat_mean carat_median
#>    <chr> <chr>          <dbl>        <dbl>
#>  1 cut   Fair           1.05          1   
#>  2 cut   Good           0.849         0.82
#>  3 cut   Very Good      0.806         0.71
#>  4 cut   Premium        0.892         0.86
#>  5 cut   Ideal          0.703         0.54
#>  6 color D              0.658         0.53
#>  7 color E              0.658         0.53
#>  8 color F              0.737         0.7 
#>  9 color G              0.771         0.7 
#> 10 color H              0.912         0.9 
#> 11 color I              1.03          1   
#> 12 color J              1.16          1.11

^{Created on 2020-04-13 by the reprex package (v0.3.0)}

However, are there better ways to do this? For beginners the second approach might be too complicated because it requires knowledge about purrr::map() and the !! operator.
Or is there a well maintained package for this type of analysis? (e. g. like skimer, but able to do this type of analysis)

system · May 4, 2020, 9:25am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.