How to pass 2 groups of columns to a function

I have written a function to "deduplicate" a dataframe.
A unique row is based on having a unique combination of one of more ID (ie key) fields
Deciding which duplicate to keep is based on taking the row with the maximum of a single field.
Example below.

But what what if the maximum also needed to be a "composite" - so I can specify the row to keep as the one with a maximum based on multiple fields?

I can see that the new "by" functionality is by the joins in dplyr 1.1.0 might be a way in. But can't see how I would use them.

library(tidyverse)

DeDupByMax = function(dfin , maxfield, ... ) {
  
  if (missing(...)) {
    cat("DeDupByMax unique IDs missing\n")
    stop()
  }
  quoID <- enquos(...)
  maxfieldquo <- enquo(maxfield)
  
  dfout <- dfin %>%
    arrange(!!!quoID, desc(!!maxfieldquo)) %>%
    distinct(!!!quoID, .keep_all = TRUE)
}

TodayDate = Sys.Date()

df = tibble(unique = c(1,1,2,3), other = 1:4, datefield = c(TodayDate, TodayDate - 1, TodayDate, TodayDate))

print(df)
#> # A tibble: 4 × 3
#>   unique other datefield 
#>    <dbl> <int> <date>    
#> 1      1     1 2023-02-24
#> 2      1     2 2023-02-23
#> 3      2     3 2023-02-24
#> 4      3     4 2023-02-24

dfdedup = DeDupByMax(df, datefield, unique)

print(dfdedup)
#> # A tibble: 3 × 3
#>   unique other datefield 
#>    <dbl> <int> <date>    
#> 1      1     1 2023-02-24
#> 2      2     3 2023-02-24
#> 3      3     4 2023-02-24

Created on 2023-02-24 with reprex v2.0.2

Hello,

The way you might want to think about this differently is as a group_by() problem.
Here is an example

library(tidyverse)

set.seed(1) #Only needed for reproducibility 
df = data.frame(
  key1 = c(1,2),
  key2 = c(1,2,3,4),
  val1 = runif(12),
  val2 = runif(12)
)
df
#>    key1 key2       val1      val2
#> 1     1    1 0.26550866 0.6870228
#> 2     2    2 0.37212390 0.3841037
#> 3     1    3 0.57285336 0.7698414
#> 4     2    4 0.90820779 0.4976992
#> 5     1    1 0.20168193 0.7176185
#> 6     2    2 0.89838968 0.9919061
#> 7     1    3 0.94467527 0.3800352
#> 8     2    4 0.66079779 0.7774452
#> 9     1    1 0.62911404 0.9347052
#> 10    2    2 0.06178627 0.2121425
#> 11    1    3 0.20597457 0.6516738
#> 12    2    4 0.17655675 0.1255551

#Filter by multiple values sequentially
df %>% group_by(key1, key2) %>% 
  filter(val1 == max(val1)) %>% 
  filter(val2 == max(val2)) %>% 
  ungroup() %>% distinct()
#> # A tibble: 4 × 4
#>    key1  key2  val1  val2
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     2     4 0.908 0.498
#> 2     2     2 0.898 0.992
#> 3     1     3 0.945 0.380
#> 4     1     1 0.629 0.935

#Filter by composite value (no keys removed)
df %>% group_by(key1, key2) %>% 
  filter((val1 + val2) == max(val1 + val2)) %>% 
  ungroup() %>% distinct()
#> # A tibble: 4 × 4
#>    key1  key2  val1  val2
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     1     3 0.573 0.770
#> 2     2     2 0.898 0.992
#> 3     2     4 0.661 0.777
#> 4     1     1 0.629 0.935

#Filter by composite value (keys removed)
df %>% group_by(key1, key2) %>% 
  filter(val1 == max(val1) & val2 == max(val2)) %>% 
  ungroup() %>% distinct()
#> # A tibble: 2 × 4
#>    key1  key2  val1  val2
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     2     2 0.898 0.992
#> 2     1     1 0.629 0.935

Created on 2023-02-24 by the reprex package (v2.0.1)

As shown in the example above, when you try to design a filtering system that will always return a row for each unique key, you have to be careful what composite logic you use (maximum based on multiple fields as you give as an example).

Hope this helps,
PJ

Funnily enough, my first version of this function was written in terms of group_by and filter. But:
arrange/distinct was much faster
I think it struggled with NAs in a way that arrange/distinct does not

But actually my question is less about the method so much as how to pass 2 groups of columns into a function. inner_join etc have join_by() and reframe has .by. I suspect I need to be able to call

DeDupByMax(dfin , maxfields = .by(c1,c2,c3), IDfields = .by(c4,c5,c6))

But how do I unravel the .by arguments in the function>

Mike

Hi,

I suggest you look more into the dplyr data masking as this will help you using more recent syntax.

Here is an example of passing multiple variables at once:

library(tidyverse)

set.seed(1) #Only needed for reproducibility 
df = data.frame(
  key1 = c(1,2),
  key2 = c(1,2,3,4),
  val1 = runif(12),
  val2 = runif(12)
)

myFun = function(df, gb, ar){
  df %>% group_by(across({{gb}})) %>% 
    summarise(n = n(), .groups = "drop") %>% 
    arrange(across({{ar}}))
}

myFun(df, c(key1, key2), c(key2, key1))
#> # A tibble: 4 × 3
#>    key1  key2     n
#>   <dbl> <dbl> <int>
#> 1     1     1     3
#> 2     2     2     3
#> 3     1     3     3
#> 4     2     4     3

Created on 2023-02-24 by the reprex package (v2.0.1)

Note that this is only possible if the vector of col names does not contain any functions. For example, if you wanted to sort by desc(key2), key1 you would have to do:

myFun = function(df, gb, ...){
  df %>% group_by(across({{gb}})) %>% 
    summarise(n = n(), .groups = "drop") %>% 
    arrange(...)
}

myFun(df, c(key1, key2), desc(key2), key1)
#> # A tibble: 4 × 3
#>    key1  key2     n
#>   <dbl> <dbl> <int>
#> 1     2     4     3
#> 2     1     3     3
#> 3     2     2     3
#> 4     1     1     3

Created on 2023-02-24 by the reprex package (v2.0.1)

I'm still not good at this myself, but at least it might point you in the right direction :slight_smile:

Hope this helps,
PJ

Just to close the loop - I found and documented the answer here Pass orderby to dplyr arrange - #5 by mikecrobp

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.