Manipulating groups when min/max etc not the right function

mikecro · February 22, 2022, 1:34pm

I have a piece of hobby code which has thrown up a problem I can't solve.
It relates to fantasy football and manipulating data about which (fantasy) managers have transferred players
The naïve assumption would be that a given player (element)does not get transferred more than once in a week (event) - but that proves not to be the case
So if there is a sequence of trades in and out in the same week, what is the net transfer?

transfersnet below has a partial solution in that in tells me which players per event have been traded in more than out, or out more than in. So I know the net trades. What I don't know is who was traded for whom.

In order to get who traded for whom, I reckon I need to group transfers by event, and then somehow treat each of the rows by storing the first in/out pair. And then for each subsequent pair:

if the in is already in a stored pair as an out, replace the out of the stored pair with the new out.
Likewise: if the out is already in a stored pair as an in, replace the in of the stored pair with the new in.
else add the new pair

I am guessing I can implement functions to the above operating on a list
But I have no idea how to manipulate groups and condense the rows without using summarise

Sorry if this is not well-explained. Writing this up has helped my thinking about it, but I am not sure I know what to do with a group_by if not using summarise!

Input data looks like

> transfers
# A tibble: 40 x 4
   element_in element_out event time                       
        <int>       <int> <int> <chr>                      
 1        491          67     3 2021-08-22T21:27:08.843876Z
 2        196         254     3 2021-08-27T23:40:43.015581Z
 3        399         277     4 2021-09-11T09:58:39.486116Z
 4        579         348     4 2021-09-11T09:58:39.493032Z
 5        430          78     5 2021-09-14T20:48:56.023986Z
 6        559          69     7 2021-09-28T13:18:44.050136Z
 7         30         448     7 2021-09-28T13:18:44.054258Z
 8        121         185     7 2021-09-28T13:18:44.054805Z
 9        256         237     7 2021-09-28T13:18:44.055313Z
10        127         275     7 2021-09-28T13:18:44.055828Z
# ... with 30 more rows

library(plyr)
library(dplyr)

transfers = 
  structure(list(element_in = c(399L, 579L, 430L, 559L, 30L, 121L, 
                                256L, 127L, 44L, 262L, 482L, 173L, 154L, 22L, 125L, 205L, 307L, 
                                78L, 475L, 270L, 249L, 169L, 23L, 432L, 122L, 399L, 64L, 261L, 
                                419L, 21L, 4L, 63L, 579L, 559L, 30L, 121L, 256L, 127L, 44L, 262L
  ), element_out = c(277L, 348L, 78L, 69L, 448L, 185L, 237L, 275L, 
                     290L, 491L, 196L, 233L, 240L, 399L, 419L, 413L, 430L, 579L, 30L, 
                     559L, 44L, 121L, 127L, 256L, 262L, 22L, 125L, 154L, 173L, 482L, 
                     78L, 205L, 307L, 270L, 475L, 23L, 122L, 169L, 249L, 432L), 
  event = c(4L, 
4L, 5L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 7L, 7L, 7L, 7L, 7L), 
time = c("2021-09-11T09:58:39.486116Z", 
 "2021-09-11T09:58:39.493032Z", "2021-09-14T20:48:56.023986Z", 
 "2021-09-28T13:18:44.050136Z", "2021-09-28T13:18:44.054258Z", 
 "2021-09-28T13:18:44.054805Z", "2021-09-28T13:18:44.055313Z", 
 "2021-09-28T13:18:44.055828Z", "2021-09-28T13:18:44.056324Z", 
 "2021-09-28T13:18:44.056840Z", "2021-09-28T13:18:44.057337Z", 
 "2021-09-28T13:18:44.057836Z", "2021-09-28T13:18:44.058331Z", 
 "2021-09-28T13:18:44.058824Z", "2021-09-28T13:18:44.059313Z", 
 "2021-09-28T13:18:44.059819Z", "2021-09-28T13:18:44.060318Z", 
 "2021-09-28T13:18:44.060872Z", "2021-09-28T13:19:49.644860Z", 
 "2021-09-28T13:19:49.650382Z", "2021-09-28T13:19:49.651567Z", 
 "2021-09-28T13:19:49.652626Z", "2021-09-28T13:19:49.653605Z", 
 "2021-09-28T13:19:49.654344Z", "2021-09-28T13:19:49.655445Z", 
 "2021-09-28T13:19:49.656552Z", "2021-09-28T13:19:49.657677Z", 
 "2021-09-28T13:19:49.658723Z", "2021-09-28T13:19:49.659701Z", 
 "2021-09-28T13:19:49.660526Z", "2021-09-28T13:19:49.661612Z", 
 "2021-09-28T13:19:49.662672Z", "2021-09-28T13:19:49.663489Z", 
 "2021-09-28T13:20:23.170741Z", "2021-09-28T13:20:23.179240Z", 
 "2021-09-28T13:20:23.180441Z", "2021-09-28T13:20:23.181475Z", 
 "2021-09-28T13:20:23.182479Z", "2021-09-28T13:20:23.183104Z", 
 "2021-09-28T13:20:23.183661Z")), row.names = c(NA, -40L), class = c("tbl_df", 
                "tbl", "data.frame"))


transfersin = transfers %>%
  group_by(id = element_in, event) %>%
  summarise(transfersin = n(),timein = max(time), .groups = "drop")

transfersout = transfers %>%
  group_by(id = element_out, event) %>%
  summarise(transfersout = n(),timeout = max(time), .groups = "drop")

transfersall = transfersin %>%
  full_join(transfersout, by=c("id","event")) %>%
  mutate(nettransfer = coalesce(transfersin,0) - coalesce(transfersout,0)) %>%
  arrange(event)

transfersnet = transfersall %>%
  filter(nettransfer != 0)

AlexisW · February 25, 2022, 2:01am

If I understand correctly, your goal is to transform this:

library(dplyr)

transfer <- tribble(~element_in, ~element_out, ~event,
                              1,            2,      1,
                              3,            4,      1,
                              4,            5,      1,
                              5,            6,      1,
                              1,            2,      2,
                              3,            4,      2)

Into that:

result <- tribble(~element_in, ~element_out, ~event,
                             1,            2,      1,
                             3,            6,      1,
                             1,            2,      2,
                             3,            4,      2)

The "likewise" part doesn't matter: if you sort your data beforehand, you are always in the first case, never in the second. That's easier.

So, trying to put what you wrote into an algorithm the idea is:

read from the in and out columns. This is a loop, so we need an index to know which element we are reading
for each new element of in, compare to existing elements of out
build a resulting in and out, and for each new element we either modify a previous element or add it to the end. This is the same loop, but the output vector does not necessarily have the same length as the input: we need to keep a separate index for writing

So, a basic structure would look like that (calling xinand xout the vectors we read, and yin and yout the vectors we write):

while(1){
    read_index <- read_index + 1

    # look it up
    existing_out <- which(xin[read_index] == yout[1:write_index])

    if(length(existing_out) == 1){
      # the current in already exists in a previous out, we overwrite the previous out
      yout[existing_out] <- xout[read_index]
      next
    } else{
      # the current in is new, write an element
      yin[write_index] <- xin[read_index]
      yout[write_index] <- xout[read_index]
      write_index <- write_index + 1
    }
  }

We need to add some boilerplate, for example a break to exit the loop when we're done. We also want to initialize our output vectors before we start, and truncate them when we're finished. Finally, we can put all that in a function. One aspect of R functions is that they only return a single object, so we put our two columns in a data frame.

function(tr){
  xin <- tr[[1]]
  xout <- tr[[2]]
  n <- length(xin)
  stopifnot(n == length(xout))
  stopifnot(n > 1)
  
  read_index <- 2
  write_index <- 2
  yin <- integer(length = n)
  yin[1:2] <- xin[1:2]
  
  yout <- integer(length = n)
  yout[1:2] <- xout[1:2]
  while(1){
    read_index <- read_index + 1
    if(read_index == n+1) break
    
    existing_out <- which(xin[read_index] == yout[1:write_index])
    if(length(existing_out) == 1){
      # the current in already exists in a previous out
      yout[existing_out] <- xout[read_index]
      next
    } else{
      yin[write_index] <- xin[read_index]
      yout[write_index] <- xout[read_index]
      write_index <- write_index + 1
    }
    
  }
  tibble(element_in = yin[1:write_index],
        element_out = yout[1:write_index])
}

And finally, we can check it all. Since the function works on data frames, the best is to use nest(), that way we can change the number of columns without using summarize().

library(tidyverse)

transfer <- tribble(~element_in, ~element_out, ~event,
                              1,            2,      1,
                              3,            4,      1,
                              4,            5,      1,
                              5,            6,      1,
                              1,            2,      2,
                              3,            4,      2)

result <- tribble(~element_in, ~element_out, ~event,
                             1,            2,      1,
                             3,            6,      1,
                             1,            2,      2,
                             3,            4,      2)
result
#> # A tibble: 4 x 3
#>   element_in element_out event
#>        <dbl>       <dbl> <dbl>
#> 1          1           2     1
#> 2          3           6     1
#> 3          1           2     2
#> 4          3           4     2


collapse_rows <- function(tr){
  xin <- tr[[1]]
  xout <- tr[[2]]
  n <- length(xin)
  stopifnot(n == length(xout))
  stopifnot(n > 1)
  
  read_index <- 2
  write_index <- 2
  yin <- integer(length = n)
  yin[1:2] <- xin[1:2]
  
  yout <- integer(length = n)
  yout[1:2] <- xout[1:2]
  while(1){
    read_index <- read_index + 1
    if(read_index == n+1) break
    
    existing_out <- which(xin[read_index] == yout[1:write_index])
    if(length(existing_out) == 1){
      # the current in already exists in a previous out
      yout[existing_out] <- xout[read_index]
      next
    } else{
      yin[write_index] <- xin[read_index]
      yout[write_index] <- xout[read_index]
      write_index <- write_index + 1
    }
    
  }
  tibble(element_in = yin[1:write_index],
        element_out = yout[1:write_index])
}

result2 <- transfer |>
  group_by(event) |>
  arrange(element_in, element_out) |>
  nest() |>
  transmute(processed_data = map(data, collapse_rows)) |>
  unnest(processed_data) |>
  ungroup() |>
  relocate(element_in, element_out)

waldo::compare(result, result2)
#> v No differences

^{Created on 2022-02-24 by the reprex package (v2.0.1)}

If the function looks ugly, that's normal, it's pretty much C-style code. I think we could do something more R-style using match(). But I can't figure out a good way, and I feel it would be even harder to read. It's easy for the in:

in_match <- match(xin, xout)
yin <- xin[is.na(in_match)]

But I can't figure out a good way to compute the out.

mikecro · February 25, 2022, 6:24pm

You understood my problem well.
I will test out your solution
I think dplyr was ruled out whilst the sequence of events was important - hence the usefulness of your loops

In the meantime, I had constructed a solution: within each game week (event):
1 determine number of ins and outs for each id
2 net ins mean in, net outs means out. ins = outs means no change
3 rank the net ins by cost and the net outs by cost (and also by position played - but I didn't include that info in the problem statement) I then pair the most expensive in with most out etc

system · March 18, 2022, 6:25pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.