Multi-column-type Summarise

I recently asked a similar question so it's good to know I'm not the only one puzzling over this. :slight_smile:

I think we can avoid using a join by slightly tweaking your existing code

Expected output

df1 <- full_join(colors, others) 
#> Joining, by = c("Species", "year")

Solution

df2 <- iris %>%
  group_by(Species, year) %>%
  summarise_all(funs(if(is.numeric(.)) mean(., na.rm = TRUE) else paste(., collapse="; "))) %>% 
  select(Species, year, color, everything())

identical(ungroup(df1), ungroup(df2))
#> [1] TRUE

I can avoid using the join in this case as I don't think the first function is actually ever used. This is because group_by variables are not included as variables to summarise when using summarise_all.

Is the real use case more complex though? If so, I don't know of an 'off-the-shelf' tidyverse solution for this. One workaround I've been using is to replace the use of group_by with nest and work on the nested list column.

For example, to avoid using joins you could nest everything except the groups you'd like to summarise and then bind columns for each type of summary needed. E.g.

df3 <- iris %>% 
  as_tibble() %>% 
  nest(-Species, -year) %>% 
  mutate(data = map(data, ~ bind_cols(
      color = paste(.$color, collapse = "; "),
      summarise_if(., is.numeric, funs(mean(., na.rm = TRUE)))))) %>% 
  unnest(data)

identical(ungroup(df1), df3)
#> [1] TRUE

The trick here is each type of summary either needs to return a named single value or a one row tibble otherwise bind_cols will complain. This approach avoids using a join and gives access to variable names if needed. You can also include as many summarise_* calls as you like. But TBH, I don't find this as readable as your original code using a join.

Jenny Bryan's row-oriented workflows repository might also be another place to look - there could be some unearthed pearls of wisdom in there related to this...

3 Likes