Summarise multiple columns using multiple functions in a tidy way?

markdly · May 21, 2018, 12:38am

Hi tidyverse community, I am wondering if there is a recommended tidyverse workflow when you want to summarise multiple columns in a tibble using multiple arbitrary summary functions. In addition, the results should be contained in a 'tidy' tibble.

For example, I can summarise one column multiple ways (e.g. using min() and anyNA()):

library(tidyverse)
iris %>% summarise_at("Petal.Width", funs(min, anyNA))
#>   min anyNA
#> 1 0.1 FALSE

I can then extend the previous example to summarise multiple columns:

iris %>% summarise_at(vars(Sepal.Length:Petal.Width), funs(min, anyNA)) 
#>   Sepal.Length_min Sepal.Width_min Petal.Length_min Petal.Width_min
#> 1              4.3               2                1             0.1
#>   Sepal.Length_anyNA Sepal.Width_anyNA Petal.Length_anyNA
#> 1              FALSE             FALSE              FALSE
#>   Petal.Width_anyNA
#> 1             FALSE

Unfortunately, the above result isn't tidy. Perhaps I should use gather and spread to get the desired output:

iris %>% 
  summarise_at(vars(Sepal.Length:Petal.Width), funs(min, anyNA)) %>% 
  gather(key = "key", value = "value") %>%
  separate(key, c("variable", "stat"), sep = "_") %>%
  spread(stat, value) 
#>       variable anyNA min
#> 1 Petal.Length     0 1.0
#> 2  Petal.Width     0 0.1
#> 3 Sepal.Length     0 4.3
#> 4  Sepal.Width     0 2.0

This is where I wonder if I'm heading in the wrong direction. Is there another tidyverse way I should do this? One downfall of this approach is the logical result for anyNA() is now coerced to numeric.

My current workaround is to ditch summarise_at() completely and define a function which returns a one row tibble. The first column returned is the original tibble column name. Each remaining column relates to an arbitrary summary function.

report <- function(x, name) { 
  tibble(
    name  = name, 
    min   = min(x), 
    anyNA = anyNA(x)
  )}

Then I use purrr::imap_dfr() to get the result:

iris %>% select(Sepal.Length:Petal.Width) %>% imap_dfr(report)
#> # A tibble: 4 x 3
#>   name           min anyNA
#>   <chr>        <dbl> <lgl>
#> 1 Sepal.Length   4.3 FALSE
#> 2 Sepal.Width    2   FALSE
#> 3 Petal.Length   1   FALSE
#> 4 Petal.Width    0.1 FALSE

This seems OK to me and was the approach I suggested as an answer to a recent question. But I'm not sure if the workaround is necessary and I've missed an easy step somewhere.

Is the workaround a good way to go or am I in danger of getting into some bad habits?
Are there other tidyverse approaches recommended in this situation?

rensa · May 21, 2018, 3:00am

Some of the examples in the scoped summarise docs use summarise_all to apply multiple functions to multiple columns. For example:

by_species %>% summarise_all(funs(min, max))
#> # A tibble: 3 x 9
#>   Species    Sepal.Length_min Sepal.Width_min Petal.Length_min Petal.Width_min
#>   <fct>                 <dbl>           <dbl>            <dbl>           <dbl>
#> 1 setosa                  4.3             2.3              1               0.1
#> 2 versicolor              4.9             2                3               1  
#> 3 virginica               4.9             2.2              4.5             1.4
#> # ... with 4 more variables: Sepal.Length_max <dbl>, Sepal.Width_max <dbl>,
#> #   Petal.Length_max <dbl>, Petal.Width_max <dbl>
# Note that output variable name now includes the function name, in order to
# keep things distinct.

summarise_all operates on all columns except the grouping ones, so you don't get the control of using select helpers like vars(matches("blah")). But since the summarise_* functions aggregate the output, you can't have any other columns left anyway—so you could just use select_* to drop unwanted columns before hand.

For example, you could select the grouping variable(s) and columns of interest with select_at, then group_by the grouping variable(s), then summarise_all:

iris %>%
  select_at(vars("Species", starts_with("Petal"))) %>%
  group_by(Species) %>%
  summarise_all(c("min", "max", "mean"))
#> # A tibble: 3 x 7
#>   Species    Petal.Length_min Petal.Width_min Petal.Length_max Petal.Width_max
#>   <fct>                 <dbl>           <dbl>            <dbl>           <dbl>
#> 1 setosa                  1               0.1              1.9             0.6
#> 2 versicolor              3               1                5.1             1.8
#> 3 virginica               4.5             1.4              6.9             2.5
#>  ... with 2 more variables: Petal.Length_mean <dbl>, Petal.Width_mean <dbl>

I hope that helps!

EDIT: also, you can give summarise_all a named list of arguments using the funs wrapper, if you'd rather avoid the pain of renaming all those aggregated columns (although you'll still get a concatenation of the column name and the renamed aggregation function)

markdly · May 21, 2018, 3:32am

Thanks rensa,

I think your last comment really nails the key question I'm trying to find a good solution for:

Is there a recommended way to avoid this scenario with the column names repeated as many times as there are summary functions? I guess I consider this result 'untidy' and makes it difficult if I want to use the results downstream.

For example, I can get a summary by group using my workaround but it also involves ditching the usual use of group_by as well as summarise_at or summarise_all. Hence, this is why I'm wondering if I've missed something along the way! E.g. to summarise by group using the workaround to get a 'tidy' tibble as a result you could do:

library(tidyverse)

report <- function(x, name) { 
  tibble(
    name  = name, 
    min   = min(x), 
    anyNA = anyNA(x)
  )}

iris %>% 
  nest(-Species) %>% 
  mutate(data = map(data, ~ imap_dfr(., report))) %>% 
  unnest()
#> # A tibble: 12 x 4
#>    Species    name           min anyNA
#>    <fct>      <chr>        <dbl> <lgl>
#>  1 setosa     Sepal.Length   4.3 FALSE
#>  2 setosa     Sepal.Width    2.3 FALSE
#>  3 setosa     Petal.Length   1   FALSE
#>  4 setosa     Petal.Width    0.1 FALSE
#>  5 versicolor Sepal.Length   4.9 FALSE
#>  6 versicolor Sepal.Width    2   FALSE
#>  7 versicolor Petal.Length   3   FALSE
#>  8 versicolor Petal.Width    1   FALSE
#>  9 virginica  Sepal.Length   4.9 FALSE
#> 10 virginica  Sepal.Width    2.2 FALSE
#> 11 virginica  Petal.Length   4.5 FALSE
#> 12 virginica  Petal.Width    1.4 FALSE

In this case I nest() the variables I want to summarise with each remaining row after nesting representing the results I'd like to "group by".

rensa · May 21, 2018, 3:45am

I'd say it's less that one solution is tidier than the other and more that the output you're looking for is longer (whereas mine is wider). Making data wider or longer is a pretty common task that the tidyr package tackles well; if you wanted to convert my solution to a longer format, you could gather all those columns up and then separate the concatenated column names into a pair of columns:

iris %>%
  select_at(vars("Species", starts_with("Petal"))) %>%
  group_by(Species) %>%
  summarise_all(c("min", "max", "mean")) %>%
  gather(key = "key", value = "value", -Species) %>%
  separate(key, into = c("measurement", "statistic"), sep = "[_]")
#> # A tibble: 18 x 4
#>    Species    measurement  statistic value
#>    <fct>      <chr>        <chr>     <dbl>
#>  1 setosa     Petal.Length min       1    
#>  2 versicolor Petal.Length min       3    
#>  3 virginica  Petal.Length min       4.5  
#>  4 setosa     Petal.Width  min       0.1  
#>  5 versicolor Petal.Width  min       1    
#>  6 virginica  Petal.Width  min       1.4

#> (and a bunch of other rows)

(Keep in mind the sep argument for separate, which gives it a pattern to use to break the values. If your column values have underscores in them already, you might need to a bit of fiddling!)

Does that look more like the sort of thing you'd like to see?

markdly · May 21, 2018, 4:54am

Thanks rensa, I appreciate the feedback!

I'd use this example if all the summary functions returned the same type. One possible downside is if I use arbitrary summary functions returning different types (e.g. character (typeof()), logical (anyNA()), and numeric (mean())). All values in the statistic column get coerced into a common type (in this case character):

library(tidyverse)
iris %>%
  select_at(vars("Species", starts_with("Petal"))) %>%
  group_by(Species) %>%
  summarise_all(c("typeof", "anyNA", "mean")) %>%
  gather(key = "key", value = "value", -Species) %>%
  separate(key, into = c("measurement", "statistic"), sep = "[_]")
#> # A tibble: 18 x 4
#>    Species    measurement  statistic value 
#>    <fct>      <chr>        <chr>     <chr> 
#>  1 setosa     Petal.Length typeof    double
#>  2 versicolor Petal.Length typeof    double
#>  3 virginica  Petal.Length typeof    double
#>  4 setosa     Petal.Width  typeof    double
#>  5 versicolor Petal.Width  typeof    double
#>  6 virginica  Petal.Width  typeof    double
#>  7 setosa     Petal.Length anyNA     FALSE 
#>  8 versicolor Petal.Length anyNA     FALSE 
#>  9 virginica  Petal.Length anyNA     FALSE 
#> 10 setosa     Petal.Width  anyNA     FALSE 
#> 11 versicolor Petal.Width  anyNA     FALSE 
#> 12 virginica  Petal.Width  anyNA     FALSE 
#> 13 setosa     Petal.Length mean      1.462 
#> 14 versicolor Petal.Length mean      4.26  
#> 15 virginica  Petal.Length mean      5.552 
#> 16 setosa     Petal.Width  mean      0.246 
#> 17 versicolor Petal.Width  mean      1.326 
#> 18 virginica  Petal.Width  mean      2.026

So ideally the end result also needs to be able to return the original value type for each summary function...

rensa · May 21, 2018, 5:23am

Ah, good catch: you might lose some important data that way. Okay, if you want columns for the aggregation functions but not the original columns, let's do the gathering first!

iris %>%
  select_at(vars("Species", starts_with("Petal"))) %>%
  gather(key = 'key', value = 'value', -Species) %>%
  group_by(Species, key) %>%
  summarise(mean = mean(value), min = min(value), anyNA = anyNA(value))
#> # A tibble: 6 x 5
#> # Groups:   Species [?]
#>   Species    key           mean   min anyNA
#>   <fct>      <chr>        <dbl> <dbl> <lgl>
#> 1 setosa     Petal.Length 1.46    1   FALSE
#> 2 setosa     Petal.Width  0.246   0.1 FALSE
#> 3 versicolor Petal.Length 4.26    3   FALSE
#> 4 versicolor Petal.Width  1.33    1   FALSE
#> 5 virginica  Petal.Length 5.55    4.5 FALSE
#> 6 virginica  Petal.Width  2.03    1.4 FALSE

EDIT: whoops, forgot to add the key column (which has the original measurement) to group_by. That's better!

EDIT: in hindsight, grouping multiple times and then just gathering by the aggregation functions probably would've been a more natural/readable way to do the approaches above anyway

markdly · May 21, 2018, 7:50am

Thanks! Yes I think this works for the data posed in my original question.

So now I think my current workaround is only needed when both the columns being summarised and the resulting summary tibble contain columns of multiple types.

For example, I don't think the existing gather and spread approaches will work in this situation without coercing values to a common type at some stage:

library(tidyverse)
df <- tibble::tribble(
  ~ints, ~char,  ~lgl,
     1L,   "a",  TRUE,
     2L,   "b",  FALSE,
     3L,   "c",  NA)

report <- function(x, name) {
  tibble(
    column = name,
    typeof = typeof(x),
    mean   = ifelse(is.numeric(x) | is.logical(x), mean(x, na.rm = TRUE), NA),
    any_na = anyNA(x)
  )}

df %>% imap_dfr(report)
#> # A tibble: 3 x 4
#>   column typeof     mean any_na
#>   <chr>  <chr>     <dbl> <lgl> 
#> 1 ints   integer     2   FALSE 
#> 2 char   character  NA   FALSE 
#> 3 lgl    logical     0.5 TRUE

(Moderators: I realise this is 'moving the goal posts' of the original question so please let me know if this should be opened as a new topic)

mishabalyasin · May 21, 2018, 3:17pm

Did you come across skimr package? I think, it has a very similar approach to what @rensa suggested - creating named lists and then using those named lists to summarize all necessary columns at once. Perhaps, you can use similar approach for your problem as well.

markdly · May 22, 2018, 4:40am

Thanks @mishabalyasin and @rensa, I'm marking rensa's answer as the solution since it solves the specific example posted in the original question. And thanks mishabalyasin for the skimr tip. The skim_with() function looks like it does the very thing I was after. I'm looking forward to trying it out.

The feedback has really clarified for me about which summarise approach to use when. I think the following flowchart says it best. Perhaps others might find it useful:

Side note: Flowchart made using mermaidjs. Here's the source url

rensa · May 22, 2018, 5:12am

Honestly, I didn't realise that skimr had skim_with. That's probably going to be the go-to solution in 99% of cases (though working through this problem was a nice exercise for me ).