I have a table (~1M rows) of memberships, where each line reflects one membership cycle, with a start date and an end date, generally 12-13 months apart. I'd like to determine how many memberships and how many distinct members are active on a given day, but I don't know how to approach the calculation.
library(tidyverse); library(lubridate)
#> -- Attaching packages -------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
#> v ggplot2 2.2.1 v purrr 0.2.4
#> v tibble 1.3.4 v dplyr 0.7.4
#> v tidyr 0.7.2 v stringr 1.2.0
#> v readr 1.1.1 v forcats 0.2.0
#> -- Conflicts ----------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
memberships <- tibble(
memberID = c("A", "A", "A", "B"),
membershipID = 1:4 %>% as.factor,
start_date = ymd(c(20100101, 20101220, 20120415, 20110605)),
end_date = ymd(c(20101231, 20111231, 20130430, 20120531)),
mo_dur = interval(start_date, end_date) %>%
as.duration() / dyears() * 12)
ggplot(memberships) +
geom_segment(aes(x = start_date, xend = end_date,
y = membershipID, yend = membershipID)) +
geom_text(vjust = -0.5, hjust=0, size = 3,
aes(x = start_date, y = membershipID,
label = paste(round(mo_dur, 2), "months")))
# Desired output
# Date active_num active_distinct_members
# 2010-01-01 1 1
# ...
# 2010-12-20 2 1
# ...
# 2011-04-15 2 2