Counting Events Based on Gaps in Date Sequences

liberry · October 16, 2020, 6:11pm

I have a data set that is structured like so:

 id      begin        end
  1 2020-01-01 2020-01-02
  1 2020-01-03 2020-01-04
  2 2020-01-07 2020-01-08
  2 2020-01-10 2020-01-12
  3 2020-01-15 2020-01-19
  3 2020-01-20 2020-01-25
  4 2020-01-23 2020-01-24
  4 2020-01-26 2020-01-27

Code to generate data:

library(dplyr)

id <- 1:4

begin <- as.Date(c('2020-01-01', '2020-01-07', '2020-01-15', '2020-01-23', 
                   '2020-01-03', '2020-01-10', '2020-01-20', '2020-01-26'))

end  <- as.Date(c('2020-01-02', '2020-01-08', '2020-01-19', '2020-01-24',
                  '2020-01-04', '2020-01-12', '2020-01-25', '2020-01-27'))

df <- data.frame(id=rep(id,2), begin, end)

df <- arrange(df, id)

I am trying to count events in the data that are defined as consecutive date sequences for unique IDs. The issue is that these sequences are spread across multiple rows.

For example, id #1 above had one countable event-the date sequence that spans 2020-01-01 to 2020-01-04.

ID #2 had two countable events, the date sequence from 2020-01-07 to 2020-01-08 and the second sequence from 2020-01-10 to 2020-01-12.

If there is a gap in the sequence of one or more calendar days, that is the end of one event and the beginning of another. E.g. hotel stays, if you check in on Friday and out on Sunday, then check-in again on Wednesday and out on the following Friday, those are two separate stays. If you'd remained in the hotel from Friday-Friday, that would have been a single stay.

I can't quite work out how to do this, given the structure of this data. I started off by doing this, which gives the earliest and latest dates for each unique ID # but does not take gaps into account, so doesn't really do what I need.

df %>% group_by(id) %>%
       mutate(bgn=min(begin), end=max(end)) %>%
       ungroup() %>%
       distinct(id, bgn, end)

jmcvw · October 16, 2020, 8:34pm

What is your goal here? Is it only to count the number of events, or to compute the length of events, or find the longest event, or ...?

jmcvw · October 16, 2020, 9:16pm

Edit 2

So, my first attempt I misunderstood the question and provided a clumsy solution. This attempt still feels a bit off, but it does give a count of continuous events for each id.
Maybe you can adapt this if it doesn't do what you need

df %>% 
  group_by(id) %>% 
  mutate(event = sum(begin - lag(end) > 1, na.rm = T)) %>%
  summarize(n_events = max(1, sum(event)))

#  id n_event
#  <int>   <dbl>
#      1           1
#      2           2
#      3           1
#      4           2

liberry · October 16, 2020, 9:44pm

Intent is just to count the number of events.

liberry · October 16, 2020, 9:59pm

Thanks, that works, now I just need to think through why it works.

jmcvw · October 17, 2020, 1:23pm

Good, glad it worked out for you.

I'll give a quick breakdown of the code to clarify.

begin - lag(end) > 1
- subtract the previous end date from the current beginning date, and check if it is more than 1 day.
sum(begin - lag(end) > 1, na.rm = T)
- The first instance in each group has no previous end date, and so will have an NA value, so I add the TRUE/FALSE's together while removing the NAs.
- If there are no breaks greater than 1 day, meaning there is only 1 event, a 0 will be recorded. This is important in the next step
max(1, sum(event))
- The values from the event column created in the previous step are then summarized by suming the number of events.
- If there is more than 1 event, all is fine, the sum will be correct.
- However, because a single continuous event is recorded as 0, we return the either the correct count, or else 1, whatever is the max.
- An alternative way to achieve the same would be to use an extra mutate statement containing an ifelse.

liberry · October 22, 2020, 3:00pm

Thanks, great explanation. I come from a professional background that tends to encourage thinking of leads and lags in terms of n time periods and always forget that you can use those in R to lag positionally(i.e. based on vector indices). And I didn't realize that you could supply a second argument to max! Definitely possible to handle using ifelse or case_when as you note, but neat trick that will come in handy nonetheless.

system · October 29, 2020, 3:00pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.