Duplicates within time frames

Hi,
I have a challenge which I cannot deal with.
I have a df with car registrations (RegNo) and date they visited service (LastWortshopDate).
I know how to flag registrations if they include visiting a dealer within 12 months. I simply merge RegNo with a visit Month and then check repeated visits. If at least two visits are within 12 month time frame, the RegNo is flagged as Retained.Reg

data.source <- data.frame(
  stringsAsFactors = FALSE,
             RegNo = c("GGG","GGG","AAA","AAA",
                       "AAA","AAA","BBB","CCC","DDD","DDD","EEE","EEE",
                       "EEE","EEE","EEE","EEE","FFF","FFF","HHH","HHH","III",
                       "III","III","III","III","III","III","III","III"),
  LastWorkshopDate = c("2018-07-04","2019-06-19",
                       "2020-08-03","2021-07-14","2022-04-13","2022-06-21",
                       "2018-01-12","2021-02-16","2021-05-09","2022-09-06",
                       "2018-06-13","2019-07-26","2020-07-06","2021-08-05",
                       "2022-07-09","2022-08-15","2018-01-09","2019-05-26",
                       "2018-01-09","2019-03-26","2018-06-18","2018-10-22",
                       "2019-04-01","2019-06-21","2019-10-21","2021-06-17",
                       "2022-01-12","2022-06-17","2022-10-21"),
              Year = c(2018,2019,2020,2021,2022,
                       2022,2018,2021,2021,2022,2018,2019,2020,2021,2022,
                       2022,2018,2019,2018,2019,2018,2018,2019,2019,
                       2019,2021,2022,2022,2022),
             Month = c(7,6,8,7,4,6,1,2,5,9,6,
                       7,7,8,7,8,1,5,1,3,6,10,4,6,10,6,1,6,10)
)

data.source

library(dplyr)
result <- data.source %>% 
  mutate(Month_reg = paste0(Month, RegNo)) %>% 
  group_by(RegNo) %>% 
  arrange((Month_reg))%>% 
  mutate(Number.of.Month_reg = row_number()) %>% 
  ungroup() %>% 
  add_count(Month_reg, name = "Same.Month.Visits") %>%
  mutate(Same.Month.Visits = Same.Month.Visits) %>% mutate(
    Retained.Reg = case_when(
      Same.Month.Visits >1 ~ 1)) %>%
  group_by(RegNo) %>% 
  mutate(Retained.Reg = as.numeric(any(as.logical(Retained.Reg)))) %>% 
  arrange(RegNo, LastWorkshopDate)

result

Now my task is to flag reg visiting not at the same month but within 9-15 months (so AAA, GGG and HHH meet this criteria) and I have no idea if that is possible in R.
It does not matter if there are multiple visits a year or not. If there are at least two visits within 9-12 months a flag should be added.
My brain is boiling and I cannot find a solution. Maybe you can?

Here is a method for calculating the time between each visit and the previous visit. You can then filter for whatever date difference you want.
LastWorkshopDate

data.source <- data.frame(
  stringsAsFactors = FALSE,
  RegNo = c("GGG","GGG","AAA","AAA",
            "AAA","AAA","BBB","CCC","DDD","DDD","EEE","EEE",
            "EEE","EEE","EEE","EEE","FFF","FFF","HHH","HHH","III",
            "III","III","III","III","III","III","III","III"),
  LastWorkshopDate = c("2018-07-04","2019-06-19",
                       "2020-08-03","2021-07-14","2022-04-13","2022-06-21",
                       "2018-01-12","2021-02-16","2021-05-09","2022-09-06",
                       "2018-06-13","2019-07-26","2020-07-06","2021-08-05",
                       "2022-07-09","2022-08-15","2018-01-09","2019-05-26",
                       "2018-01-09","2019-03-26","2018-06-18","2018-10-22",
                       "2019-04-01","2019-06-21","2019-10-21","2021-06-17",
                       "2022-01-12","2022-06-17","2022-10-21"),
  Year = c(2018,2019,2020,2021,2022,
           2022,2018,2021,2021,2022,2018,2019,2020,2021,2022,
           2022,2018,2019,2018,2019,2018,2018,2019,2019,
           2019,2021,2022,2022,2022),
  Month = c(7,6,8,7,4,6,1,2,5,9,6,
            7,7,8,7,8,1,5,1,3,6,10,4,6,10,6,1,6,10))
library(dplyr)  
Visit9_15_Mon <- data.source |> 
  mutate(LastWorkshopDate = as.Date(LastWorkshopDate)) |> 
  arrange(RegNo, LastWorkshopDate) |> 
  group_by(RegNo) |> 
  mutate(LagDate = lag(LastWorkshopDate),
         DateDiff = LastWorkshopDate - LagDate) |> 
  filter(DateDiff < 450, DateDiff > 270)
Visit9_15_Mon
#> # A tibble: 8 × 6
#> # Groups:   RegNo [4]
#>   RegNo LastWorkshopDate  Year Month LagDate    DateDiff
#>   <chr> <date>           <dbl> <dbl> <date>     <drtn>  
#> 1 AAA   2021-07-14        2021     7 2020-08-03 345 days
#> 2 AAA   2022-04-13        2022     4 2021-07-14 273 days
#> 3 EEE   2019-07-26        2019     7 2018-06-13 408 days
#> 4 EEE   2020-07-06        2020     7 2019-07-26 346 days
#> 5 EEE   2021-08-05        2021     8 2020-07-06 395 days
#> 6 EEE   2022-07-09        2022     7 2021-08-05 338 days
#> 7 GGG   2019-06-19        2019     6 2018-07-04 350 days
#> 8 HHH   2019-03-26        2019     3 2018-01-09 441 days

Created on 2023-03-06 with reprex v2.0.2

Thank you but this is a challenge. I've added one RegNo with multiple visits within a few years which ruins your idea.

data.source <- data.frame(
  stringsAsFactors = FALSE,
  RegNo = c("GGG","GGG","AAA","AAA",
            "AAA","AAA","BBB","CCC","DDD","DDD","EEE","EEE",
            "EEE","EEE","EEE","EEE","FFF","FFF","HHH","HHH","III",
            "III","III","III","III","III","III","III","III",
            "AJ","AJ","AJ","AJ","AJ","AJ","AJ","AJ","AJ"),
  LastWorkshopDate = c("2018-07-04","2019-06-19",
                       "2020-08-03","2021-07-14","2022-04-13","2022-06-21",
                       "2018-01-12","2021-02-16","2021-05-09","2022-09-06",
                       "2018-06-13","2019-07-26","2020-07-06","2021-08-05",
                       "2022-07-09","2022-08-15","2018-01-09","2019-05-26",
                       "2018-01-09","2019-03-26","2018-06-18","2018-10-22",
                       "2019-04-01","2019-06-21","2019-10-21","2021-06-17",
                       "2022-01-12","2022-06-17","2022-10-21","2018-06-18",
                       "2018-10-22","2019-04-01","2019-06-21","2019-10-21",
                       "2021-06-17","2022-01-12","2022-06-17","2022-10-21"),
  Year = c(2018,2019,2020,2021,2022,
           2022,2018,2021,2021,2022,2018,2019,2020,2021,2022,
           2022,2018,2019,2018,2019,2018,2018,2019,2019,
           2019,2021,2022,2022,2022,2018,2018,2019,2019,2019,
           2021,2022,2022,2022),
  Month = c(7,6,8,7,4,6,1,2,5,9,6,
            7,7,8,7,8,1,5,1,3,6,10,4,6,10,6,1,6,10,
            6,10,4,6,10,6,1,6,10)
)
data.source

If you run this code:

library(dplyr)  
Visit9_15_Mon <- data.source |> 
  mutate(LastWorkshopDate = as.Date(LastWorkshopDate)) |> 
  arrange(RegNo, LastWorkshopDate) |> 
  group_by(RegNo) |> 
  mutate(LagDate = lag(LastWorkshopDate),
         DateDiff = LastWorkshopDate - LagDate)  %>% mutate(
           Retained.Reg = case_when(
             DateDiff < 450 & DateDiff > 270 ~ 1
           ))


Visit9_15_Mon

you will see that AJ contains visits within the date brackets but some visits in between ruin the idea. Even if you take into account just two dates:
18/06/2018
21/06/2019
you can see this dealer should be flagged as Retained.Reg

How can I fix it?

Maybe we could check if there are two dates within +/-3 months withing a range of all dates for each RegNo?

library(tidyverse)
library(lubridate)


data.source <- data.frame(
  stringsAsFactors = FALSE,
  RegNo = c(
    "GGG", "GGG", "AAA", "AAA",
    "AAA", "AAA", "BBB", "CCC", "DDD", "DDD", "EEE", "EEE",
    "EEE", "EEE", "EEE", "EEE", "FFF", "FFF", "HHH", "HHH", "III",
    "III", "III", "III", "III", "III", "III", "III", "III",
    "AJ", "AJ", "AJ", "AJ", "AJ", "AJ", "AJ", "AJ", "AJ"
  ),
  LastWorkshopDate = c(
    "2018-07-04", "2019-06-19",
    "2020-08-03", "2021-07-14", "2022-04-13", "2022-06-21",
    "2018-01-12", "2021-02-16", "2021-05-09", "2022-09-06",
    "2018-06-13", "2019-07-26", "2020-07-06", "2021-08-05",
    "2022-07-09", "2022-08-15", "2018-01-09", "2019-05-26",
    "2018-01-09", "2019-03-26", "2018-06-18", "2018-10-22",
    "2019-04-01", "2019-06-21", "2019-10-21", "2021-06-17",
    "2022-01-12", "2022-06-17", "2022-10-21", "2018-06-18",
    "2018-10-22", "2019-04-01", "2019-06-21", "2019-10-21",
    "2021-06-17", "2022-01-12", "2022-06-17", "2022-10-21"
  ),
  Year = c(
    2018, 2019, 2020, 2021, 2022,
    2022, 2018, 2021, 2021, 2022, 2018, 2019, 2020, 2021, 2022,
    2022, 2018, 2019, 2018, 2019, 2018, 2018, 2019, 2019,
    2019, 2021, 2022, 2022, 2022, 2018, 2018, 2019, 2019, 2019,
    2021, 2022, 2022, 2022
  ),
  Month = c(
    7, 6, 8, 7, 4, 6, 1, 2, 5, 9, 6,
    7, 7, 8, 7, 8, 1, 5, 1, 3, 6, 10, 4, 6, 10, 6, 1, 6, 10,
    6, 10, 4, 6, 10, 6, 1, 6, 10
  )
)

get_close_date <- function(dates,window_size){
  map(dates,\(d){
    otherdates <- as.Date(setdiff(dates,d),origin="1970-01-01")
    matchers <- otherdates[between(x=otherdates,
                                   left = d-window_size,
                                   right=d+window_size)]
    list(benchmark_date = d,
         matching_dates = list(matchers))
  })
}

(summary_data <- group_by(
  data.source,
  RegNo) |> reframe(get_dates = get_close_date(as.Date(LastWorkshopDate ),
                                          window_size = 90))|> 
    unnest_wider(col="get_dates") |> 
    unnest_longer(col="matching_dates"))

# A tibble: 38 × 3
   RegNo benchmark_date matching_dates
   <chr> <date>         <date>        
 1 AAA   2020-08-03     NA            
 2 AAA   2021-07-14     NA            
 3 AAA   2022-04-13     2022-06-21    
 4 AAA   2022-06-21     2022-04-13    
 5 AJ    2018-06-18     NA            
 6 AJ    2018-10-22     NA            
 7 AJ    2019-04-01     2019-06-21    
 8 AJ    2019-06-21     2019-04-01    
 9 AJ    2019-10-21     NA            
10 AJ    2021-06-17     NA            
11 AJ    2022-01-12     NA            
12 AJ    2022-06-17     NA            
13 AJ    2022-10-21     NA            
14 BBB   2018-01-12     NA            
15 CCC   2021-02-16     NA            
16 DDD   2021-05-09     NA            
17 DDD   2022-09-06     NA            
18 EEE   2018-06-13     NA            
19 EEE   2019-07-26     NA            
20 EEE   2020-07-06     NA            
21 EEE   2021-08-05     NA            
22 EEE   2022-07-09     2022-08-15    
23 EEE   2022-08-15     2022-07-09    
24 FFF   2018-01-09     NA            
25 FFF   2019-05-26     NA            
26 GGG   2018-07-04     NA            
27 GGG   2019-06-19     NA            
28 HHH   2018-01-09     NA            
29 HHH   2019-03-26     NA            
30 III   2018-06-18     NA            
31 III   2018-10-22     NA            
32 III   2019-04-01     2019-06-21    
33 III   2019-06-21     2019-04-01    
34 III   2019-10-21     NA            
35 III   2021-06-17     NA            
36 III   2022-01-12     NA            
37 III   2022-06-17     NA            
38 III   2022-10-21     NA

thank you, I think, we are getting there. I think I need to:

  1. use this code to solve the first part of the task
library(dplyr)  
Visit9_15_Mon <- data.source |> 
    mutate(LastWorkshopDate = as.Date(LastWorkshopDate)) |> 
  arrange(RegNo, LastWorkshopDate) |> 
  group_by(RegNo) |> 
  mutate(LagDate = lag(LastWorkshopDate),
         DateDiff = LastWorkshopDate - LagDate)  %>% mutate(
           Retained.Reg = case_when(
             DateDiff < 450 & DateDiff > 270 ~ 1
           )) %>%
  group_by(RegNo) %>% 
  mutate(Retained.Reg = as.numeric(any(as.logical(Retained.Reg))))
         
Visit9_15_Mon
  1. change window size to window_size < 450 & window_size > 270 to find dates within 9-15 months for each RegNo and add matching_dates to the Visit9_15_Mon - I cannot do that :frowning:
  2. run below code if Retained.Reg is NA (so RegNo not tesolved in the previous stage), not sure which of two dates is earlier or later so negative DateDiff should be converted to positive I think
mutate(DateDiff = matching_dates - LastWorkshopDate)  %>% mutate(
           Retained.Reg = case_when(
             DateDiff < 450 & DateDiff > 270 ~ 1
           )) %>%
  group_by(RegNo) %>% 
  mutate(Retained.Reg = as.numeric(any(as.logical(Retained.Reg))))

as a result AJ and III should have the Retained.Reg=1 (I hope)

If the above is not working (I am not sure about window_size < 450 & window_size > 270 part), I have another idea.

I can see that problematic RegNo (AJ and III) have odd dates and even dates matching original LagDate and DateDiff idea. Is it possible to try:

3a. run below code

library(dplyr)  
Visit9_15_Mon <- Visit9_15_Mon |> 
group_by(RegNo) %>% 
  arrange((LastWorkshopDate))%>% 
  mutate(VisitNumber = row_number()) %>% 
  ungroup() %>% 
  arrange((RegNo))

3b. and then if Retained.Reg is NA (so RegNo not tesolved in the previous stage) run modified LagDate and DateDiff idea for odd VisitNumber (1,3,5,7,9 etc.) and even VisitNumber (2,4,6,8, etc) .

What do you think R masters?

I think, I solved that!


library(dplyr)  
Visit9_15_Mon <- data.source |> 
  group_by(RegNo) %>% 
  arrange((LastWorkshopDate))%>% 
  mutate(VisitNumber = row_number()) %>% 
  ungroup() %>% 
  arrange((RegNo))|> 
  add_count(RegNo, name = "reg.visits") %>% 
  mutate(reg.visits = reg.visits) %>%
  mutate(LastWorkshopDate = as.Date(LastWorkshopDate)) |> 
  arrange(RegNo, LastWorkshopDate) |> 
  group_by(RegNo) |> 
  mutate(LagDate = lag(LastWorkshopDate),
         DateDiff = LastWorkshopDate - LagDate)  %>% mutate(
           Retained.Reg = case_when(
             DateDiff < 450 & DateDiff > 270 ~ 1
           )) %>%
  group_by(RegNo) %>% 
  mutate(Retained.Reg = as.numeric(any(as.logical(Retained.Reg))))
         
Visit9_15_Mon


library(dplyr)  
Visit9_15_Mon_done <- Visit9_15_Mon |> 
  filter(Retained.Reg==1) 
Visit9_15_Mon_done


library(dplyr)  
Visit9_15_Mon_odd <- Visit9_15_Mon |> 
  filter(is.na(Retained.Reg) & (VisitNumber==1 |VisitNumber==3 |VisitNumber==5 |VisitNumber==7 |VisitNumber==9 |VisitNumber==11 |VisitNumber==13 |VisitNumber==15)) |> 
  arrange(RegNo, LastWorkshopDate) |> 
  group_by(RegNo) |> 
  mutate(LagDate = lag(LastWorkshopDate),
         DateDiff = LastWorkshopDate - LagDate)  %>% mutate(
           Retained.Reg = case_when(
             DateDiff < 450 & DateDiff > 270 ~ 1
           )) %>%
  group_by(RegNo) %>% 
  mutate(Retained.Reg = as.numeric(any(as.logical(Retained.Reg))))

Visit9_15_Mon_odd

library(dplyr)  
Visit9_15_Mon_even <- Visit9_15_Mon |> 
  filter(is.na(Retained.Reg) & (VisitNumber==2 |VisitNumber==4 |VisitNumber==6 |VisitNumber==8 |VisitNumber==10 |VisitNumber==12 |VisitNumber==14 |VisitNumber==16)) |> 
  arrange(RegNo, LastWorkshopDate) |> 
  group_by(RegNo) |> 
  mutate(LagDate = lag(LastWorkshopDate),
         DateDiff = LastWorkshopDate - LagDate)  %>% mutate(
           Retained.Reg = case_when(
             DateDiff < 450 & DateDiff > 270 ~ 1
           )) %>%
  group_by(RegNo) %>% 
  mutate(Retained.Reg = as.numeric(any(as.logical(Retained.Reg))))

Visit9_15_Mon_even

Visit9_15_Mon <- rbind(Visit9_15_Mon_done, Visit9_15_Mon_odd, Visit9_15_Mon_even)

Visit9_15_Mon

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.