Dear community,
I am quite new with R and I am stuck with the following issue (which I hope some of you may help me with). I have a huge database (which I cannot disclose) but is it is structured as follows:
This is a database looking at relationships between buyers and sellers, allowing us to observe contracts between several buyers and sellers. What I wish to do is a) find everytime a relationship between a Buyer and a new seller is established (what the dummy "new" does) and b)for all month of all relationships, calculate the distance in months between the month of the line and the last new relationship started.
I have 25 million observations and 3 variables of of interest:
Code_ID_Buy Code_ID_Sell Month
The New variable states that there was no transaction before between the buyer and the seller, this is the first occurrence in line i so the variable new takes the value 1
The Distance : the distance in months between the two last occurences of new = 1 provided Code_ID_Buy is the same for the two rows. But this is not exactly what I want (see column Distancelastr1 where I would like to calculate the distance for each line bewteen the relationship and the last relationship entered by the buyer, so basically if new=1, this works, but not if new=0...)
A reduced sample would look like this:
library(data.table)
set.seed(1)
Data <- data.frame(
Month = c(1,1,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,6,3,4,5),
Amount = rnorm(21,mean=100,sd=20),
Code_ID_Buy = c("100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","102D","102D","102D"),
Code_ID_Sell = c("98C","99C","98C","99C","98C","99C","96V","98C","99C","96V","98C","99C","96V","94D","98C","99C","96V","94D","25A","25A","25A")
)
Data$new<-0
setDT(Data)[order(Month, Code_ID_Buy, Code_ID_Sell), new := {
r <- rowid(Code_ID_Buy, Code_ID_Sell)
+(r==1L)
}]
Data[Month==1L, new:=0L]
Data[new==1L, distance := .SD[.SD, on=.(Code_ID_Buy, Month<Month), mult="last",
by=.EACHI, i.Month - x.Month]$V1]
Data[new==0L, distance := .SD[.SD, on=.(Code_ID_Buy, Month<Month), mult="last",
by=.EACHI, i.Month - x.Month]$V1]
Data$Distancewithlastr1 = c(NA,NA,NA,NA,NA,NA,NA,1,1,1,2,2,2,2,1,1,1,1,0,1,2)
Data$LastRelationshipseller = c("98C 99C","98C 99C","98C 99C","98C 99C","98C 99C","98C 99C","98C 99C","96V","96V","96V","96V","96V","96V","96V","94D","94D","94D","94D",NA,"25A","25A")
View(Data)
Here Distancelastr1 should take value "NA" if month = 1 and no "new = 1" has been recorded since month 1. If there is a new = 1 for a given buyer let us say for month x, then Distancewithlastr1 in month i calculates month i - month x. The column LastRelationshipseller is meant to find the last seller which entered in a relationship with this buyer prior to the month of line i.
Thank you in advance,