Gap between dates for duplicated records

Hi, I have this simple df:

data.source <- 
  data.frame(stringsAsFactors=FALSE,
             VIN = c("aaa", "aaa", "bbb", "bbb", "ccc", "ccc", "ccc", "ccc",
                              "ddd", "eee", "fff", "fff", "ggg", "ggg",
                              "ggg", "ggg", "hhh", "hhh", "hhh", "hhh", "hhh",
                              "iii", "jjj"),
             EventDate = c("2019-09-05 16:41:28", "2019-12-10 10:40:44",
                           "2019-11-07 17:31:45",
                           "2019-12-10 10:40:13", "2019-08-05 16:41:28",
                           "2019-12-10 10:40:44", "2019-11-07 17:31:45",
                           "2019-12-10 10:40:13", "2019-11-07 17:31:45",
                           "2019-12-10 10:40:13", "2019-11-07 17:31:45",
                           "2019-12-11 16:01:33", "2019-10-11 16:01:32",
                           "2020-01-11 16:01:33", "2020-01-07 12:36:08",
                           "2020-01-01 12:36:08", "2020-01-07 12:36:08",
                           "2019-12-13 10:17:29", "2019-12-13 10:17:29",
                           "2019-12-13 10:17:29", "2019-11-28 17:37:13",
                           "2019-11-28 17:37:13", "2019-11-28 17:37:13")
  )

data.source

How can I create a new column showing a gap to previous date for duplicated VINs? Can I also create a variable with the previous date?

dplyr has lead() and lag() that work for groups (i.e. VIN)
to get the previous value lag(EventDate,n=1)

Thank you. I have this solution:

library(tidyverse)
library(lubridate)

# Step 1 = duplicates

step1 <- data.source %>% 
  group_by(VIN) %>% 
  add_count(VIN) %>%
  ungroup() 
step1 <- step1 %>% 
  rename(Nr.of.Dups = n)

step1

# Step 2 = selecting duplicated VINs only

step2 <- subset(step1, Nr.of.Dups>1)

step2


# Step 3 = finding gaps between duplicates

step3 <- step2 %>%
  mutate(EventDate = ymd_hms(EventDate)) %>%
  group_by(VIN) %>%
  summarise(Gap = diff(x = range(EventDate)), Dups=mean(Nr.of.Dups)) %>%
  mutate(Flag = Gap < duration(num = 90,
                               units = "days"))

but it shows ranges instead of gaps between each duplicated VIN.

Also, my solution requires a few stages and merging step3 back with the data.source.
What I really need is EventDate gaps between the same VINs ordered by EventDate

Hurray, I think I've got that!

library(tidyverse)
library(lubridate)

result <- data.source %>%
  group_by(VIN) %>%
  mutate(PreviousDate = lag(EventDate, n=1, order_by=EventDate))

result

Thank You :slight_smile:

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.