Identification of rows / calculation in large datasets

Dear community,

I am quite new with R and I am stuck with the following issue (which I hope some of you may help me with). I have a huge database (which I cannot disclose) but is it is structured as follows:

This is a database looking at relationships between buyers and sellers, allowing us to observe contracts between several buyers and sellers. What I wish to do is a) find everytime a relationship between a Buyer and a new seller is established (what the dummy "new" does) and b)for all month of all relationships, calculate the distance in months between the month of the line and the last new relationship started.

I have 25 million observations and 3 variables of of interest:
Code_ID_Buy Code_ID_Sell Month

The New variable states that there was no transaction before between the buyer and the seller, this is the first occurrence in line i so the variable new takes the value 1

The Distance : the distance in months between the two last occurences of new = 1 provided Code_ID_Buy is the same for the two rows. But this is not exactly what I want (see column Distancelastr1 where I would like to calculate the distance for each line bewteen the relationship and the last relationship entered by the buyer, so basically if new=1, this works, but not if new=0...)

A reduced sample would look like this:

library(data.table)
set.seed(1)
Data <- data.frame(
  Month = c(1,1,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,6,3,4,5),
  Amount = rnorm(21,mean=100,sd=20),
  Code_ID_Buy = c("100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","102D","102D","102D"),
  Code_ID_Sell = c("98C","99C","98C","99C","98C","99C","96V","98C","99C","96V","98C","99C","96V","94D","98C","99C","96V","94D","25A","25A","25A")
)
  

Data$new<-0

setDT(Data)[order(Month, Code_ID_Buy, Code_ID_Sell), new := {
  r <- rowid(Code_ID_Buy, Code_ID_Sell)
  +(r==1L)
}]

Data[Month==1L, new:=0L]

Data[new==1L, distance := .SD[.SD, on=.(Code_ID_Buy, Month<Month), mult="last",
                              by=.EACHI, i.Month - x.Month]$V1]

Data[new==0L, distance := .SD[.SD, on=.(Code_ID_Buy, Month<Month), mult="last",
                              by=.EACHI, i.Month - x.Month]$V1]

Data$Distancewithlastr1 = c(NA,NA,NA,NA,NA,NA,NA,1,1,1,2,2,2,2,1,1,1,1,0,1,2)
Data$LastRelationshipseller = c("98C 99C","98C 99C","98C 99C","98C 99C","98C 99C","98C 99C","98C 99C","96V","96V","96V","96V","96V","96V","96V","94D","94D","94D","94D",NA,"25A","25A")
View(Data)

Here Distancelastr1 should take value "NA" if month = 1 and no "new = 1" has been recorded since month 1. If there is a new = 1 for a given buyer let us say for month x, then Distancewithlastr1 in month i calculates month i - month x. The column LastRelationshipseller is meant to find the last seller which entered in a relationship with this buyer prior to the month of line i.

Thank you in advance,

Hi @RifiFi, a couple of questions and requests: If you could you edit your code so that it's copied from a file rather than from the console, that would make it easier to copy and paste for folks who'd like to help, and in the fifth line you should change <- to =.

At this point, my main question is: Is your table meant to be in chronological order? I ask since there are multiple lines per month, so I wasn't sure how to determine the last seller.

Thanks, @RifiFi -- a small edit for you: Could you remove the triple backtick just before 'A reduced sample would look like this'?

I agree that the requirements seem a little on the hazy side, anyhow I found it interesting example to practice and see how dtplyr works as a data.table backend for the dplyr verbs that I'm most familiar and comfortable with.
I came up with this data.table code:

library(data.table)
library(dtplyr)
library(dplyr, warn.conflicts = FALSE)
set.seed(1)
Data <- data.frame(
  Month = c(1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 3, 4, 5),
  Amount = rnorm(21, mean = 100, sd = 20),
  Code_ID_Buy = c("100D", "100D", "100D", "100D", "100D", "100D", "100D", "100D", "100D", "100D", "100D", "100D", "100D", "100D", "100D", "100D", "100D", "100D", "102D", "102D", "102D"),
  Code_ID_Sell = c("98C", "99C", "98C", "99C", "98C", "99C", "96V", "98C", "99C", "96V", "98C", "99C", "96V", "94D", "98C", "99C", "96V", "94D", "25A", "25A", "25A")
)

Data$new <- 0

setDT(Data)[order(Month, Code_ID_Buy, Code_ID_Sell), new := {
  r <- rowid(Code_ID_Buy, Code_ID_Sell)
  +(r == 1L)
}]


Data[, `:=`(Month_at_new = case_when(
  new == 1 ~ Month,
  TRUE ~ -1
))][, `:=`(Month_at_new = max(Month_at_new)), keyby = .(
  Code_ID_Buy,
  Code_ID_Sell
)][, `:=`(dist_since_new = Month - Month_at_new),
  keyby = .(Code_ID_Buy, Code_ID_Sell)
][order(
  Month, Code_ID_Buy,
  Code_ID_Sell
)]

The last whacky part of which, I had gottten from peeking at the data.table syntax generated by the lazy_dt pipeline :

 my_df <- lazy_dt(Data) %>%
  mutate(Month_at_new = case_when(
    new == 1 ~ Month,
    TRUE ~ -1
  )) %>%
  group_by(Code_ID_Buy, Code_ID_Sell) %>%
  mutate(Month_at_new = max(Month_at_new)) %>%
  mutate(
    dist_since_new =
      Month - Month_at_new
  ) %>%
  ungroup() %>%
  arrange(Month, Code_ID_Buy, Code_ID_Sell)
1 Like

Thank you a lot for your answer and sorry for the delay of mine (I had some reorganization issues with the coronavirus). I actually cannot run the second part because Rstudio cannot find lazy dt.

On another forum, somebody came up with the following solution which seems to work :

library(data.table)
set.seed(1)
Data <- data.frame(
Month = c(1,1,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,6,3,4,5),
Amount = rnorm(21,mean=100,sd=20),
Code_ID_Buy = c("100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","102D","102D","102D"),
Code_ID_Sell = c("98C","99C","98C","99C","98C","99C","96V","98C","99C","96V","98C","99C","96V","94D","98C","99C","96V","94D","25A","25A","25A")
)

Data$new<-0

setDT(Data)[order(Month, Code_ID_Buy, Code_ID_Sell), new := {
r <- rowid(Code_ID_Buy, Code_ID_Sell)
+(r==1L)
}]

Data[Month==1L, new:=0L]

Data[, dlr := if (k>0L) rleid(Month) - 1L, .(Code_ID_Buy, k=cumsum(new))][,
dlr := fifelse(new==1L, shift(dlr), dlr), Code_ID_Buy]

View(Data)

What is the difference with your solution ? (sorry I am trying to understand R better)

Thank you in advance,

this package is necessary for lazy_dt()

Yes I only had to restart Rstudio, sorry for that.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

Hello @dromano thank you for your message, I changed the formatting right now.

Regarding the last seller, this is more like a control but please find the updated code above I think it is clearer. Basically, if Buyer A deals with Seller B for the first time, new = 1. So I ordered my table in terms of Buyers Sellers and Chronologically in order to be able to use my formula with .SD mentioned above to build the variable "new"

No worries -- it was mainly so it's easy to use in case other folks would like to help, too. I can't help now, but will post again once I can...

Done, @dromano sorry for that !

Hi @RifiFi,

After looking at your questions and code, I have a couple of suggestions: Since you're trying to treat each buyer in the same way, it may help to simplify the problem by thinking about the case of a single buyer, and then manually create a table that is exactly what you're looking for -- right now it's not clear to me how your own solution falls short of what you're hoping for, which means I'm not sure what that is, exactly.

Here's some code I think boils down the context, and allows you to create realistic scenarios to your taste, and then use to craft your ideal final table for the data:

library(tidyverse)

n.sellers <- 10
n.months <- 5
sellers <- paste0('s', 1:n.sellers)
months <- sample(n.months, 20, replace = TRUE) %>% sort()
sales <- sample(sellers, 20, replace = TRUE)

data <- 
  tibble(month = months, seller = sales)
data 
#> # A tibble: 20 x 2
#>    month seller
#>    <int> <chr> 
#>  1     1 s7    
#>  2     1 s1    
#>  3     1 s9    
#>  4     2 s10   
#>  5     2 s3    
#>  6     2 s7    
#>  7     2 s10   
#>  8     2 s5    
#>  9     3 s8    
#> 10     3 s5    
#> 11     3 s8    
#> 12     3 s7    
#> 13     3 s3    
#> 14     3 s9    
#> 15     4 s9    
#> 16     4 s3    
#> 17     4 s8    
#> 18     4 s6    
#> 19     5 s2    
#> 20     5 s6

Created on 2020-03-10 by the reprex package (v0.3.0)