Turn arrival and departure timestamps into on-the-hour headcounts

Hello there. I'm trying to turn a dataset of months worth of entrance and exit timestamps into an every-hour-on-the-hour headcount of everybody in the building. The full dataset covers months worth of entries and exits.

I'm partway there -- I figured out how to find the count of people who would be in the building at one particular time... I'm just not able to think how I would create the hour-by-hour loop to tie the headcounts to.

Sample csv (p)

rowid,arrival,departure
1,2020-07-01 00:04:23,2020-07-01 01:19:00
2,2020-07-01 00:12:57,2020-07-01 02:38:00
3,2020-07-01 00:17:35,2020-07-01 08:50:00
4,2020-07-01 00:46:30,2020-07-01 02:58:00
5,2020-07-01 00:51:07,2020-07-01 06:10:00
6,2020-07-01 01:05:03,2020-07-01 02:35:00
7,2020-07-01 01:10:22,2020-07-01 03:26:00
8,2020-07-01 02:39:24,2020-07-01 14:25:00
9,2020-07-01 02:41:53,2020-07-01 03:41:00
10,2020-07-01 02:43:34,2020-07-01 05:10:00
stayCountByHour <- p %>%
  filter(mdy_hms("07-01-2020 05:00:00") %within% as.interval(arrival, departure))

length(stayCountByHour$id)

Thank you for any help

Just to clarify. Hour by hour on the hour means:

Arrive 08:05 leave 08:55 then I'm not counted?

What do you want as an output?

Date | Hour | Count?
2020-07-01 | 01 | 25

Do you want a mega efficient solution or a run once and it doesn't matter if it takes 5 minutes?

Here is a simple version. @CALUM_POLWART makes an excellent point about people who are not ever present on the hour.

library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(purrr)
DF <- read.csv("~/R/Play/Dummy.csv")
DF <- mutate(DF, arrival = ymd_hms(arrival), departure = ymd_hms(departure))

HourSeq <- seq.POSIXt(from = ymd_hms("2020-07-01 00:00:00"),
                    to = ymd_hms("2020-07-01 15:00:00"), by = "hour")

CountDf <- data.frame(Hour = HourSeq)

CountOcc <- function (TIME) {
  sum(TIME %within% as.interval(DF$arrival, DF$departure))
}

counts <- CountDf %>% group_by(Hour) %>% summarize(HourOcc = CountOcc(Hour))
#> `summarise()` ungrouping output (override with `.groups` argument)
counts
#> # A tibble: 16 x 2
#>    Hour                HourOcc
#>    <dttm>                <int>
#>  1 2020-07-01 00:00:00       0
#>  2 2020-07-01 01:00:00       5
#>  3 2020-07-01 02:00:00       6
#>  4 2020-07-01 03:00:00       6
#>  5 2020-07-01 04:00:00       4
#>  6 2020-07-01 05:00:00       4
#>  7 2020-07-01 06:00:00       3
#>  8 2020-07-01 07:00:00       2
#>  9 2020-07-01 08:00:00       2
#> 10 2020-07-01 09:00:00       1
#> 11 2020-07-01 10:00:00       1
#> 12 2020-07-01 11:00:00       1
#> 13 2020-07-01 12:00:00       1
#> 14 2020-07-01 13:00:00       1
#> 15 2020-07-01 14:00:00       1
#> 16 2020-07-01 15:00:00       0

Created on 2021-03-15 by the reprex package (v0.3.0)

Thank you, this is precisely what I was envisioning.

I agree that @CALUM_POLWART's point is worth bringing up. I'll discuss see if they want a headcount every 15 or 30 min to address that. Seems like they're mostly looking for mileposts for now however.

Thank you again!

Headcount per hour, but not on the hour might achieve what they want. Depends on the job what the slot size would be

You can try the code below. It may seem a bit length. However, I found it is better to understand a few months down the line.

If you are reading in a csv file, you can coerce the date values as.character through col_types call in the readr::read_csv. Converting date character values to datetime is much easier than other types.

library(magrittr)
library(dplyr)
library(lubridate)
Arrivals <-  data.frame(arrival = c("2020-07-01 00:04:23","2020-07-01 01:19:00",
                       "2020-07-01 00:12:57","2020-07-01 02:38:00",
                       "2020-07-01 00:17:35","2020-07-01 08:50:00",
                       "2020-07-01 00:46:30","2020-07-01 02:58:00",
                       "2020-07-01 00:51:07","2020-07-01 06:10:00",
                       "2020-07-01 01:05:03","2020-07-01 02:35:00",
                       "2020-07-01 01:10:22","2020-07-01 03:26:00",
                       "2020-07-01 02:39:24","2020-07-01 14:25:00",
                       "2020-07-01 02:41:53","2020-07-01 03:41:00",
                       "2020-07-01 02:43:34","2020-07-01 05:10:00"))

Arrivals <- Arrivals %>% 
  mutate(arrival = lubridate::ymd_hms(arrival),
         arrival_hour = lubridate::hour(arrival),
         arrival_date = lubridate::date(arrival))

Hourly_Arrivals <- Arrivals %>% 
  group_by(arrival_date,arrival_hour) %>% 
  summarise(customers = n(),
            .groups="drop")

Hourly_Arrivals

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.