Is it possible to count # of occurrences within x distance?

AJF · March 8, 2018, 10:27pm

Hi,

Does anyone know of a "tidy" way (or even a non-tidy, yet efficient way) to count the number of occurrences within a specified distance?

For example, imagine 15 people entered my store over a period of 10 minutes, and I tracked which minute they entered, I could have a table looking like

person_id	minute
1	1
2	1
3	2
4	3
5	3
6	4
7	5
8	6
9	7
10	7
11	7
12	8
13	8
14	9
15	10

Now, say I wanted to add a column for each person, to count how many other people entered the store within 2 minutes of them. Is there a straightforward/tidy way to do it?

I can think of a complex way to calculate this, but it seems to not be scalable to larger data sets...

   library(tidyverse)
    #> -- Attaching packages ---------------------------------------------------------------------- tidyverse 1.2.1 --
    #> v ggplot2 2.2.1.9000     v purrr   0.2.4     
    #> v tibble  1.4.2          v dplyr   0.7.4     
    #> v tidyr   0.8.0          v stringr 1.3.0     
    #> v readr   1.1.1          v forcats 0.3.0
    #> -- Conflicts ------------------------------------------------------------------------- tidyverse_conflicts() --
    #> x dplyr::filter() masks stats::filter()
    #> x dplyr::lag()    masks stats::lag()

    df1 <- tibble::tribble(
      ~person_id, ~minute,
              1L,      1L,
              2L,      1L,
              3L,      2L,
              4L,      3L,
              5L,      3L,
              6L,      4L,
              7L,      5L,
              8L,      6L,
              9L,      7L,
             10L,      7L,
             11L,      7L,
             12L,      8L,
             13L,      8L,
             14L,      9L,
             15L,     10L
      )

    crossing(df1, df1) %>%
      mutate(time_diff = abs(minute - minute1)) %>%
      group_by(person_id, minute) %>%
      summarize(ppl_wi_2_min = sum(time_diff <= 2)) %>%
      ungroup()
    #> # A tibble: 15 x 3
    #>    person_id minute ppl_wi_2_min
    #>        <int>  <int>        <int>
    #>  1         1      1            5
    #>  2         2      1            5
    #>  3         3      2            6
    #>  4         4      3            7
    #>  5         5      3            7
    #>  6         6      4            6
    #>  7         7      5            8
    #>  8         8      6            8
    #>  9         9      7            8
    #> 10        10      7            8
    #> 11        11      7            8
    #> 12        12      8            8
    #> 13        13      8            8
    #> 14        14      9            7
    #> 15        15     10            4

Does anyone know of a better way (i.e. one that would not involve crossing or expand.grid or something like that)?

Thanks!

mishabalyasin · March 9, 2018, 9:39am

My approach is not tidy for sure, but it is ~9 times faster:

library(tidyverse)

df1 <- tibble::tribble(
  ~person_id, ~minute,
  1L,      1L,
  2L,      1L,
  3L,      2L,
  4L,      3L,
  5L,      3L,
  6L,      4L,
  7L,      5L,
  8L,      6L,
  9L,      7L,
  10L,      7L,
  11L,      7L,
  12L,      8L,
  13L,      8L,
  14L,      9L,
  15L,     10L
)

start <- function(){crossing(df1, df1) %>%
  mutate(time_diff = abs(minute - minute1)) %>%
  group_by(person_id, minute) %>%
  summarize(ppl_wi_2_min = sum(time_diff <= 2)) %>%
  ungroup()}

suggestion <- function(df){
  res <- purrr::map_int(df[["person_id"]], function(person){
    minute <- df[["minute"]][[person]]
    minutes <- df[["minute"]]
    length(minutes[abs(minutes - minute) <= 2])
  })
  df[["ppl_wi_2_min"]] <- res
  df
}

microbenchmark::microbenchmark(start(), suggestion(df1), times = 500)
#> Unit: microseconds
#>             expr      min        lq      mean   median        uq       max
#>          start() 2543.284 2844.6940 3519.3019 2955.034 3200.8235 69524.556
#>  suggestion(df1)  315.082  347.1835  415.9332  361.658  382.0615  8717.945
#>  neval
#>    500
#>    500

Created on 2018-03-09 by the reprex package (v0.2.0).

If you can make your example much bigger (say, 10k-100k rows), you can shave even more time by benchmarking where my suggestion is slow and trying to optimize it even further, if you need.

AJF · March 9, 2018, 3:45pm

Thanks! This is certainly better than mine...I'll try to play with optimizing it further, but this is really helpful!

AJF · March 16, 2018, 8:13pm

In case someone finds this post later -- I improved the speed a bit beyond @mishabalyasin's excellent suggestion


    suggestion2 <- function(df){
      minutes <- df[["minute"]]
      res <- lapply(minutes, function(minute)
        sum(abs(minutes - minute) <= 2))
      df[["ppl_wi_2_min"]] <- flatten_int(res)
      df
    }

    suggestion3 <- function(df){
      minutes <- df[["minute"]]
      res <- integer(nrow(df))
      for (i in 1:nrow(df)) {
        res[i] <- sum(abs(minutes - minutes[i]) <= 2)
      }
      df[["ppl_wi_2_min"]] <- res
      df
    }

    microbenchmark::microbenchmark(start(), suggestion(df1), suggestion2(df1), suggestion3(df1), times = 1000)
    #> Unit: microseconds
    #>              expr      min       lq       mean   median       uq       max
    #>           start() 3919.328 4071.927 5043.04297 4125.645 5268.255 76707.970
    #>   suggestion(df1)  299.382  318.542  387.19039  331.201  356.007  2609.920
    #>  suggestion2(df1)   69.115   76.983   94.47256   81.774   92.038  1574.232
    #>  suggestion3(df1)   63.640   72.193  101.49175   79.379   88.617 11036.390
    #>  neval cld
    #>   1000   c
    #>   1000  b 
    #>   1000 a  
    #>   1000 a

jonspring · March 16, 2018, 11:58pm

If you had a much bigger list of people, it might be worth breaking this into two parts:

What's the rolling count of customers with a given window at minute X? You could use padr::pad, then cumsum to get the running total, then lead(cuml, 2) - lag(cuml,2) to get the rolling +/- 2min total.
Then join that table to your first table.