I have a data set that is structured like so:
id begin end
1 2020-01-01 2020-01-02
1 2020-01-03 2020-01-04
2 2020-01-07 2020-01-08
2 2020-01-10 2020-01-12
3 2020-01-15 2020-01-19
3 2020-01-20 2020-01-25
4 2020-01-23 2020-01-24
4 2020-01-26 2020-01-27
Code to generate data:
library(dplyr)
id <- 1:4
begin <- as.Date(c('2020-01-01', '2020-01-07', '2020-01-15', '2020-01-23',
'2020-01-03', '2020-01-10', '2020-01-20', '2020-01-26'))
end <- as.Date(c('2020-01-02', '2020-01-08', '2020-01-19', '2020-01-24',
'2020-01-04', '2020-01-12', '2020-01-25', '2020-01-27'))
df <- data.frame(id=rep(id,2), begin, end)
df <- arrange(df, id)
I am trying to count events in the data that are defined as consecutive date sequences for unique IDs. The issue is that these sequences are spread across multiple rows.
For example, id #1 above had one countable event-the date sequence that spans 2020-01-01 to 2020-01-04.
ID #2 had two countable events, the date sequence from 2020-01-07 to 2020-01-08 and the second sequence from 2020-01-10 to 2020-01-12.
If there is a gap in the sequence of one or more calendar days, that is the end of one event and the beginning of another. E.g. hotel stays, if you check in on Friday and out on Sunday, then check-in again on Wednesday and out on the following Friday, those are two separate stays. If you'd remained in the hotel from Friday-Friday, that would have been a single stay.
I can't quite work out how to do this, given the structure of this data. I started off by doing this, which gives the earliest and latest dates for each unique ID # but does not take gaps into account, so doesn't really do what I need.
df %>% group_by(id) %>%
mutate(bgn=min(begin), end=max(end)) %>%
ungroup() %>%
distinct(id, bgn, end)