Duplicates if gap between dates higher than x

Hi R Masters,
I have this task I cannot finish myself.
I have a small dummy df with some duplicated emails. Now, I would like to calculate number of days between the first and last response and flag records with more than 90 days. This it what I've done so far:

data.source <- 
  data.frame(stringsAsFactors=FALSE,
             EmailAddress = c("aaa", "aaa", "bbb", "bbb", "ccc", "ccc", "ccc", "ccc",
                              "ddd", "eee", "fff"),
             CreatedOn = c("2019-09-05 16:41:28", "2019-12-10 10:40:44",
                           "2019-11-07 17:31:45", "2019-12-10 10:40:13", "2019-08-05 16:41:28",
                           "2019-12-10 10:40:44", "2019-11-07 17:31:45", "2019-12-10 10:40:13",
                           "2019-11-07 17:31:45", "2019-12-10 10:40:13", "2019-11-07 17:31:45")
  )

data.source

library(tidyverse)
library(lubridate)

test <- data.source %>% 
  group_by(EmailAddress) %>% 
  arrange(CreatedOn)%>% 
  mutate(Sequence = row_number()) %>% 
  ungroup()

test

Now I think I should calculate days for records duplicates and for multiple duplicates, this gap should be between the first and the last response.
As a result I should get:
95.7 days for "aaa", 32.7 days for "bbb" and 126.7 day for "ccc".
"aaa" and "ccc" should get a flag they were duplicated beyond 90 days period.

Is this tack easy?

I think you want something like this. I ignored last 3 rows because I didn't understand what is the expected output for only one observation per email address.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date

data.source <- data.frame(stringsAsFactors=FALSE,
                          EmailAddress = c("aaa", "aaa", "bbb", "bbb", "ccc", "ccc", "ccc", "ccc"),
                          CreatedOn = c("2019-09-05 16:41:28", "2019-12-10 10:40:44", "2019-11-07 17:31:45", "2019-12-10 10:40:13", "2019-08-05 16:41:28", "2019-12-10 10:40:44", "2019-11-07 17:31:45", "2019-12-10 10:40:13"))

data.source %>%
    mutate(CreatedOn = ymd_hms(CreatedOn)) %>%
    group_by(EmailAddress) %>%
    summarise(Gap = diff(x = range(CreatedOn))) %>%
    mutate(Flag = Gap > duration(num = 90,
                                 units = "days"))
#> # A tibble: 3 x 3
#>   EmailAddress Gap            Flag 
#>   <chr>        <drtn>         <lgl>
#> 1 aaa           95.74949 days TRUE 
#> 2 bbb           32.71421 days FALSE
#> 3 ccc          126.74949 days TRUE

Created on 2020-02-06 by the reprex package (v0.3.0)

3 Likes

Wow! This is exactly that I need! Thank you :smiley:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.