Counting overlapping records per day using dplyr?

I have a table (~1M rows) of memberships, where each line reflects one membership cycle, with a start date and an end date, generally 12-13 months apart. I'd like to determine how many memberships and how many distinct members are active on a given day, but I don't know how to approach the calculation.


library(tidyverse); library(lubridate)
#> -- Attaching packages -------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
#> v ggplot2 2.2.1     v purrr   0.2.4
#> v tibble  1.3.4     v dplyr   0.7.4
#> v tidyr   0.7.2     v stringr 1.2.0
#> v readr   1.1.1     v forcats 0.2.0
#> -- Conflicts ----------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date

memberships <- tibble(
  memberID     = c("A", "A", "A", "B"),
  membershipID = 1:4 %>% as.factor,
  start_date   = ymd(c(20100101, 20101220, 20120415, 20110605)),
  end_date     = ymd(c(20101231, 20111231, 20130430, 20120531)),
  mo_dur       = interval(start_date, end_date) %>% 
    as.duration() / dyears() * 12)
  
ggplot(memberships) + 
  geom_segment(aes(x = start_date, xend = end_date,
                   y = membershipID, yend = membershipID)) +
  geom_text(vjust = -0.5, hjust=0, size = 3,
            aes(x = start_date, y = membershipID, 
                label = paste(round(mo_dur, 2), "months")))

# Desired output
# Date        active_num   active_distinct_members
# 2010-01-01  1            1
# ...
# 2010-12-20  2            1
# ...
# 2011-04-15  2            2

I think this gets you the information you want.

Note that one of the issues here is that at its core each row in memberships has to be compared to every other row to determine if there is an overlap. That can result in a lot of operations.

This solution takes a shortcut and just checks each row against the ones that follow it, but is it is all the information that I think you need. It also assumes the tibble is ordered which might not be the case in the future...

There may be a way to partition a large data set so that not all rows need be checked against all others, but that depends on the data you have.

In any case this is a non-intuitive solution that just methodically goes through all the comparisons.

It produces this:

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(lubridate))

memberships <- tibble::tibble(
    memberID     = c("A", "A", "A", "B"),
    membershipID = 1:4 %>% as.factor,
    start_date   = ymd(c(20100101, 20101220, 20120415, 20110605)),
    end_date     = ymd(c(20101231, 20111231, 20130430, 20120531)),
    mo_dur       = interval(start_date, end_date) %>%
        as.duration() / dyears() * 12
)

memberships <- tibble::rowid_to_column(memberships)

overlaps <- purrr::map(memberships$rowid, function(id) {
    if (id == nrow(memberships)) {
        NA
    } else {
        row <- memberships[memberships[["rowid"]] == id, ]
        intv <- lubridate::interval(row$start_date, row$end_date)
        # these are the id's of the rows following id
        other_ids <- (id):(nrow(memberships))
        ol <- purrr::map_int(other_ids, function(other_id) {
            other_row <- memberships[memberships[["rowid"]] == other_id, ]
            # either on end is inside of interval or start and end span it
            if (other_row$start_date %within% intv |
                    other_row$end_date %within% intv |
                    (other_row$start_date <= row$start_date &
                     other_row$end_date >= row$end_date)) {
                as.integer(other_row$rowid)
            } else {
                NA_integer_
            }
        })
        # drop the NA's
        ol <- ol[!is.na(ol)]
        # if nothing overlapped return NA
        if (length(ol > 0)) {
            ol
        } else {
            NA
        }
    }
})

# make it a tibbleso youcan bind it
overlaps <- tibble::tibble(following_overlaps = overlaps)
# add as column
memberships <- dplyr::bind_cols(memberships, overlaps)
1 Like

Thank you! It's gonna take me some time to digest what you've done here, but I really appreciate it. I'm not there yet about how to wrangle the table with overlaps into the ultimate format I want, with counts across all memberships by date, but it feels like the ingredients are there.

This makes me think of another approach that might sidestep the challenge of counting overlaps between lines. Perhaps something like this pseudocode could work:

  1. Group by memberID
  2. gather(status, date, start_date, end_date) #This gives each status change (start or end) a row
  3. mutate(count_chg = +1 if status is start, -1 if status is end
  4. arrange by date
  5. mutate(active_mem_count = cumsum(count_chg)) # to count how many active memberships for that member as of that date
  6. mutate(capped_count = min(active_mem_count, 1) # this is for the distinct count -- it limits a member to having one membership counted at any one date
  7. mutate(capped_count_chg = capped_count - lag(capped_count))
  8. Finally, ungroup and take the cumulative sum of count_chg and capped_count_chg. Then I think I'd have a list of the instantaneous member counts on each day where it changes.

Will give that a shot. Thanks again for digging into this!

I haven't thought this one through, but foverlaps from data.table came to mind. Maybe the data could be joined on itself before aggregating the counts.

2 Likes

I think this worked, and I think it should scale nicely once I expand to the full database of ~1M rows.
(BTW, is there a better way to calculate capped_count_chg in one step?)

[Code edited to fix earlier error in capped_count.]


suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(lubridate))

# Small sample
memberships <- tibble(
  memberID     = c("A", "A", "A", "B", "C"),
  membershipID = 1:5 %>% as.factor,
  start_date   = ymd(c(20100101, 20101120, 20120415, 20110605, 20100210)),
  end_date     = ymd(c(20101231, 20111231, 20130430, 20120531, 20110228)),
  mo_dur       = interval(start_date, end_date) %>% 
    as.duration() / dyears() * 12)

# # Large randomized sample
# n <- 100
# memberships <- tibble(
#   memberID     = sample(LETTERS[1:26], n, replace=TRUE),
#   membershipID = 1:n %>% as.factor,
#   start_date   = sample(seq(ymd(20100101), ymd(20171231), by="day"), n, replace=TRUE),
#   end_date     = start_date + runif(min = 350, max = 380, n=n),
#   mo_dur       = interval(start_date, end_date) %>%
#     as.duration() / dyears() * 12)

memberships_capped <- memberships %>% 
  group_by(memberID) %>%
  gather(status, date, start_date, end_date) %>%
  mutate(count_chg = if_else(status == "start_date", 1, -1)) %>%
  arrange(date) %>%
  mutate(count = cumsum(count_chg),
         capped_count = if_else(count > 1, 1, count),
         capped_count_chg = capped_count - lag(capped_count),
         capped_count_chg = if_else(is.na(capped_count_chg), 1, capped_count_chg)) %>%
  ungroup()
  

memberships_timeline <- memberships_capped %>%
  arrange(date) %>%
  group_by(date) %>%
  summarize(count_chg = sum(count_chg),
            capped_count_chg = sum(capped_count_chg)) %>%
  mutate(count = cumsum(count_chg),
         capped_count = cumsum(capped_count_chg))


ggplot(memberships) + 
  geom_segment(aes(x = start_date, xend = end_date,
                   y = memberID, yend = memberID ), 
               arrow = arrow(length = unit(0.03, "npc"))) +
  geom_point(aes(start_date, memberID)) +
  geom_text(vjust = -0.5, hjust=0, size = 3,
            aes(x = start_date, y = memberID, 
                label = paste(round(mo_dur, 2), "months")))

ggplot(memberships_timeline, aes(date)) +
  geom_step(aes(y=count), lty="dashed") +
  geom_step(aes(y=capped_count)) +
  labs(subtitle="Membership count: dashed line\nUnique member count: solid line", y="", x="")

2 Likes