# Is it possible to count # of occurrences within x distance?

#1

Hi,

Does anyone know of a "tidy" way (or even a non-tidy, yet efficient way) to count the number of occurrences within a specified distance?

For example, imagine 15 people entered my store over a period of 10 minutes, and I tracked which minute they entered, I could have a table looking like

person_id minute
1 1
2 1
3 2
4 3
5 3
6 4
7 5
8 6
9 7
10 7
11 7
12 8
13 8
14 9
15 10

Now, say I wanted to add a column for each person, to count how many other people entered the store within 2 minutes of them. Is there a straightforward/tidy way to do it?

I can think of a complex way to calculate this, but it seems to not be scalable to larger data sets...

``````   library(tidyverse)
#> -- Attaching packages ---------------------------------------------------------------------- tidyverse 1.2.1 --
#> v ggplot2 2.2.1.9000     v purrr   0.2.4
#> v tibble  1.4.2          v dplyr   0.7.4
#> v tidyr   0.8.0          v stringr 1.3.0
#> v readr   1.1.1          v forcats 0.3.0
#> -- Conflicts ------------------------------------------------------------------------- tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()

df1 <- tibble::tribble(
~person_id, ~minute,
1L,      1L,
2L,      1L,
3L,      2L,
4L,      3L,
5L,      3L,
6L,      4L,
7L,      5L,
8L,      6L,
9L,      7L,
10L,      7L,
11L,      7L,
12L,      8L,
13L,      8L,
14L,      9L,
15L,     10L
)

crossing(df1, df1) %>%
mutate(time_diff = abs(minute - minute1)) %>%
group_by(person_id, minute) %>%
summarize(ppl_wi_2_min = sum(time_diff <= 2)) %>%
ungroup()
#> # A tibble: 15 x 3
#>    person_id minute ppl_wi_2_min
#>        <int>  <int>        <int>
#>  1         1      1            5
#>  2         2      1            5
#>  3         3      2            6
#>  4         4      3            7
#>  5         5      3            7
#>  6         6      4            6
#>  7         7      5            8
#>  8         8      6            8
#>  9         9      7            8
#> 10        10      7            8
#> 11        11      7            8
#> 12        12      8            8
#> 13        13      8            8
#> 14        14      9            7
#> 15        15     10            4
``````

Does anyone know of a better way (i.e. one that would not involve `crossing` or `expand.grid` or something like that)?

Thanks!

#2

My approach is not tidy for sure, but it is ~9 times faster:

``````library(tidyverse)

df1 <- tibble::tribble(
~person_id, ~minute,
1L,      1L,
2L,      1L,
3L,      2L,
4L,      3L,
5L,      3L,
6L,      4L,
7L,      5L,
8L,      6L,
9L,      7L,
10L,      7L,
11L,      7L,
12L,      8L,
13L,      8L,
14L,      9L,
15L,     10L
)

start <- function(){crossing(df1, df1) %>%
mutate(time_diff = abs(minute - minute1)) %>%
group_by(person_id, minute) %>%
summarize(ppl_wi_2_min = sum(time_diff <= 2)) %>%
ungroup()}

suggestion <- function(df){
res <- purrr::map_int(df[["person_id"]], function(person){
minute <- df[["minute"]][[person]]
minutes <- df[["minute"]]
length(minutes[abs(minutes - minute) <= 2])
})
df[["ppl_wi_2_min"]] <- res
df
}

microbenchmark::microbenchmark(start(), suggestion(df1), times = 500)
#> Unit: microseconds
#>             expr      min        lq      mean   median        uq       max
#>          start() 2543.284 2844.6940 3519.3019 2955.034 3200.8235 69524.556
#>  suggestion(df1)  315.082  347.1835  415.9332  361.658  382.0615  8717.945
#>  neval
#>    500
#>    500
``````

Created on 2018-03-09 by the reprex package (v0.2.0).

If you can make your example much bigger (say, 10k-100k rows), you can shave even more time by benchmarking where my suggestion is slow and trying to optimize it even further, if you need.

#3

Thanks! This is certainly better than mine...I'll try to play with optimizing it further, but this is really helpful!

#4

In case someone finds this post later -- I improved the speed a bit beyond @mishabalyasin's excellent suggestion

``````
suggestion2 <- function(df){
minutes <- df[["minute"]]
res <- lapply(minutes, function(minute)
sum(abs(minutes - minute) <= 2))
df[["ppl_wi_2_min"]] <- flatten_int(res)
df
}

suggestion3 <- function(df){
minutes <- df[["minute"]]
res <- integer(nrow(df))
for (i in 1:nrow(df)) {
res[i] <- sum(abs(minutes - minutes[i]) <= 2)
}
df[["ppl_wi_2_min"]] <- res
df
}

microbenchmark::microbenchmark(start(), suggestion(df1), suggestion2(df1), suggestion3(df1), times = 1000)
#> Unit: microseconds
#>              expr      min       lq       mean   median       uq       max
#>           start() 3919.328 4071.927 5043.04297 4125.645 5268.255 76707.970
#>   suggestion(df1)  299.382  318.542  387.19039  331.201  356.007  2609.920
#>  suggestion2(df1)   69.115   76.983   94.47256   81.774   92.038  1574.232
#>  suggestion3(df1)   63.640   72.193  101.49175   79.379   88.617 11036.390
#>  neval cld
#>   1000   c
#>   1000  b
#>   1000 a
#>   1000 a
``````

#5

If you had a much bigger list of people, it might be worth breaking this into two parts:

1. What's the rolling count of customers with a given window at minute X? You could use padr::pad, then cumsum to get the running total, then lead(cuml, 2) - lag(cuml,2) to get the rolling +/- 2min total.
2. Then join that table to your first table.