Summarise across multiple columns, each with specific function

pieterjanvc · February 18, 2021, 2:57pm

Hi all,

I'm writing a package and when trying to evaluate potential user's input I run into an issue I can't seem to solve with dplyr.

The user provides a data frame, of which I only know for sure that the first n columns are present (in example column x and y). There can be any other number of columns too, which I don't know beforehand. What I need to do in the code is make a summary of data based on the columns I know, but I would like to user to be able to specify summary functions for the columns that I don't know, so I don't have to discard them during the process.

REPREX - simplified version of actual problem with more columns

Overview
We only know for sure that column x and y will be present

  x         y a b
1 1 0.2002145 1 a
2 2 0.6852186 2 b
3 1 0.9168758 3 c
4 2 0.2843995 4 d
5 1 0.1046501 5 e
6 2 0.7010575 6 f

The user then provides a list of summary functions for unknown columns

funList = c(a = sum, b = function(x) {paste0(x, collapse = "")})

And out should come this:

# A tibble: 2 x 4
# Groups:   x [2]
      x     y     a b    
  <int> <dbl> <int> <chr>
1     1 0.917     9 ace  
2     2 0.701    12 bdf

I figured out a way to to this using the following code

library(tidyverse)
set.seed(5)

#Data provided by user (only x and y are known for sure)
myData = data.frame(x = 1:2, y = runif(6), a = 1:6, b = letters[1:6])

#Summarise functions per column 
funList = list(a = sum, b = function(x) {paste0(x, collapse = "")})

#Summarise the data using the user's functions
#----------------------------------------------

#First update all columns
for(name in names(funList)){
  if(name %in% colnames(myData)){
    myData = myData %>% group_by(x) %>% 
      mutate({{name}} := funList[[{{name}}]](!!sym(name)))
  } else{
    warning("No column with name '", name, "' exists")
  }
}

#Get the summary value for y, then create the final table
myData %>% mutate(y = max(y)) %>% distinct()

But this is cheating as I would love to use the summary function from dplyr instead, but I can only provide it with a list of functions that will be applied to all columns which will fail as not all have the same type of summary

myData %>% group_by(x) %>% summarise(
  y = max(y),
  across(.cols = contains(names(funList)), .fns = funList)
)
Error: Problem with `summarise()` input `..2`.
x invalid 'type' (character) of argument
i Input `..2` is `(function (.cols = everything(), .fns = NULL, ..., .names = NULL) ...`.
i The error occurred in group 1: x = 1.
Run `rlang::last_error()` to see where the error occurred.

Any ideas??
PJ

mhakanda · March 6, 2021, 4:25am

Hi,
I am trying to understand your problem. Seeing your ideal input and output as you provided, how do you summarise y-column. User did not provide any function for y (you say x and y are present), then how do you summarise y-column in your output. Thanks

woodward · March 6, 2021, 8:43pm

It's trying to apply each function to each column, and this is throwing the error. If you use max it works.

funList <- list(a = max, b = max)

Not sure if you can prevent this. You could perhaps write a custom function to figure out which of funList to apply.

mhakanda · March 6, 2021, 8:58pm

I see y-column is summarised by "max" function. I am trying to solve it using cur_column

myData
#>   x         y a b
#> 1 1 0.2002145 1 a
#> 2 2 0.6852186 2 b
#> 3 1 0.9168758 3 c
#> 4 2 0.2843995 4 d
#> 5 1 0.1046501 5 e
#> 6 2 0.7010575 6 f
funList = c(a = sum, b = function(x) {paste0(x, collapse = "")})
funList = c(y=max, funList)

myData %>%group_by(x) %>%
  summarise(across(all_of(names(funList)),  ~ funList[[cur_column()]](.x)))
#> # A tibble: 2 x 4
#>       x     y     a b    
#> * <int> <dbl> <int> <chr>
#> 1     1 0.917     9 ace  
#> 2     2 0.701    12 bdf

pieterjanvc · March 7, 2021, 1:17pm

Hi,

I'm sorry for the confusion! The y-column is known as well. I added it in later because in the real thing, I have one column I group by (in example x), then a few I know must be there (in example y), and then unknown ones (in example a, b). Updated the reprex text.

pieterjanvc · March 7, 2021, 1:21pm

HI,

Amazing! That's exactly what I need!
I knew there must be an elegant solution

Thanks again,
PJ

library(tidyverse)
set.seed(5)

myData = data.frame(x = 1:2, y = runif(6), a = 1:6, b = letters[1:6])

funList = c(a = sum, b = function(x) {paste0(x, collapse = "")})


myData %>%group_by(x) %>%
  summarise(y = max(y), 
            across(all_of(names(funList)),  
                   ~ funList[[cur_column()]](.x)))
#> # A tibble: 2 x 4
#>       x     y     a b    
#> * <int> <dbl> <int> <chr>
#> 1     1 0.917     9 ace  
#> 2     2 0.701    12 bdf

^{Created on 2021-03-07 by the reprex package (v1.0.0)}

system · March 14, 2021, 1:21pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.