How to filter repeated date by id if it is less than a month

Hi R Studio community, I have below dataset where date is repeated by identifiers. I would like to fiter observations if the interval between two dates is less than a month.

# I would like to fiter obs if date is repeated less than a month by a identifier
    data <- data.frame(
        stringsAsFactors = FALSE,
        identifer=c("a","a","a","b","b","b","c","c", "d", "d", "d", "d"),
        id = c(1L, 2L, 3L,1L, 2L, 3L,1L, 2L,1L, 2L, 3L,4L ),
        date = c("2020-01-25","2020-02-20","2020-03-25", "2021-02-10","2021-03-09","2021-03-11", "2021-03-15","2021-04-16","2021-03-17","2021-04-17","2021-04-30","2021-05-18")
      
    )

In this code, I keep rows if they are within 30 days of the previous day with that identifer value. I am not sure if that is what you want.

DF <- data.frame(
  stringsAsFactors = FALSE,
  identifer=c("a","a","a","b","b","b","c","c", "d", "d", "d", "d"),
  id = c(1L, 2L, 3L,1L, 2L, 3L,1L, 2L,1L, 2L, 3L,4L ),
  date = c("2020-01-25","2020-02-20","2020-03-25", "2021-02-10","2021-03-09","2021-03-11", "2021-03-15","2021-04-16","2021-03-17","2021-04-17","2021-04-30","2021-05-18")
  
)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
DF2 <- DF |> group_by(identifer) |> 
  mutate(date = ymd(date),
         LagDate = lag(date),
         DayDiff = as.numeric(date - LagDate))
DF2
#> # A tibble: 12 x 5
#> # Groups:   identifer [4]
#>    identifer    id date       LagDate    DayDiff
#>    <chr>     <int> <date>     <date>       <dbl>
#>  1 a             1 2020-01-25 NA              NA
#>  2 a             2 2020-02-20 2020-01-25      26
#>  3 a             3 2020-03-25 2020-02-20      34
#>  4 b             1 2021-02-10 NA              NA
#>  5 b             2 2021-03-09 2021-02-10      27
#>  6 b             3 2021-03-11 2021-03-09       2
#>  7 c             1 2021-03-15 NA              NA
#>  8 c             2 2021-04-16 2021-03-15      32
#>  9 d             1 2021-03-17 NA              NA
#> 10 d             2 2021-04-17 2021-03-17      31
#> 11 d             3 2021-04-30 2021-04-17      13
#> 12 d             4 2021-05-18 2021-04-30      18
DF2 <- DF2 |> filter(is.na(DayDiff) | DayDiff < 30)
DF2
#> # A tibble: 9 x 5
#> # Groups:   identifer [4]
#>   identifer    id date       LagDate    DayDiff
#>   <chr>     <int> <date>     <date>       <dbl>
#> 1 a             1 2020-01-25 NA              NA
#> 2 a             2 2020-02-20 2020-01-25      26
#> 3 b             1 2021-02-10 NA              NA
#> 4 b             2 2021-03-09 2021-02-10      27
#> 5 b             3 2021-03-11 2021-03-09       2
#> 6 c             1 2021-03-15 NA              NA
#> 7 d             1 2021-03-17 NA              NA
#> 8 d             3 2021-04-30 2021-04-17      13
#> 9 d             4 2021-05-18 2021-04-30      18

Created on 2022-03-28 by the reprex package (v2.0.1)

1 Like

I'm assuming you want to test the date in a given row against the date immediately prior; If I'm wrong in that assumption, the rest of my reply is not relevant.

First you need to put all of your dates in order. Then you need to somehow determine what interval is acceptable - I used lubridate::%m-% to subtract a month from a date, but that may not fit your needs (e.g. if you define a month as 30 days, etc.). So essentially, you calculate the interval, then use dplyr::lag to test the interval you just calculated against the month in the prior row.

Hope this helps.

library(dplyr)
library(lubridate)
# Your code ##################
# I would like to fiter obs if date is repeated less than a month by a identifier
data <- data.frame(
    stringsAsFactors = FALSE,
    identifer=c("a","a","a","b","b","b","c","c", "d", "d", "d", "d"),
    id = c(1L, 2L, 3L,1L, 2L, 3L,1L, 2L,1L, 2L, 3L,4L ),
    date = c("2020-01-25","2020-02-20","2020-03-25", "2021-02-10","2021-03-09","2021-03-11", "2021-03-15","2021-04-16","2021-03-17","2021-04-17","2021-04-30","2021-05-18")
)

# 
data_intervals <- data %>% 
    arrange(date) %>% 
    mutate(
        date = as.Date(date),
        # Test whether a date minus a month is greater than or equal to the prior date
        interval_less_than_one_month = date %m-% months(1) >= lag(date)
    ) %>% 
    filter(
        interval_less_than_one_month
    )
data_intervals
#>   identifer id       date interval_less_than_one_month
#> 1         a  3 2020-03-25                         TRUE
#> 2         b  1 2021-02-10                         TRUE

Created on 2022-03-28 by the reprex package (v1.0.0)

1 Like

Thank you FJCC and dvetsch 75 for immediate response and your help. Those tips helped.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.