Finding n spatially-nearest neighbours in an effiecent way

richardo · May 23, 2021, 5:28pm

Dear all,

I'm quite new to spatial analyses in R using the sf package, so I could use some help.

I have a data frame with point coordinates. For each row, observation, in the data frame, I want to find how many other observations are within x meters. My attempts so far have led me to use st_joinwith st_is_within_distance as the type of join, and join the data frame with itself. The following code gives an example of this approach for a data frame in spDataand x set to 100 meters:

library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.0.4
#> Warning: package 'tidyr' was built under R version 4.0.4
#> Warning: package 'dplyr' was built under R version 4.0.4
#> Warning: package 'forcats' was built under R version 4.0.4
library(sf)
#> Warning: package 'sf' was built under R version 4.0.4
#> Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
library(spData)
#> Warning: package 'spData' was built under R version 4.0.5
#> To access larger datasets in this package, install the spDataLarge
#> package with: `install.packages('spDataLarge',
#> repos='https://nowosad.github.io/drat/', type='source')`

cycle_hire_P <-  st_transform(cycle_hire, 27700)


cycle_hire_neigbours <-  st_join(cycle_hire_P, 
                                 cycle_hire_P %>% select(geometry),
                   join = st_is_within_distance, dist = 100) %>% 
  st_drop_geometry() %>% 
  group_by(id, name, area) %>% 
  summarise(no_neigbours = sum(n()))
#> `summarise()` has grouped output by 'id', 'name'. You can override using the `.groups` argument.

head(cycle_hire_neigbours)
#> # A tibble: 6 x 4
#> # Groups:   id, name [6]
#>      id name               area             no_neigbours
#>   <int> <fct>              <fct>                   <int>
#> 1     1 River Street       Clerkenwell                 1
#> 2     2 Phillimore Gardens Kensington                  1
#> 3     3 Christopher Street Liverpool Street            2
#> 4     4 St. Chad's Street  King's Cross                1
#> 5     5 Sedding Street     Sloane Square               1
#> 6     6 Broadcasting House Marylebone                  1

^{Created on 2021-05-23 by the reprex package (v1.0.0)}

As far as I know, this seems to work. But I have to make this type of calculation for a very, very big data frame, so I'm wondering if this is the best approach, or if I should try some other way. As mentioned, the sf package is really new to me, so I'm not familiar with how to best use it.

Any help would be greatly appreciated!

Best, Richard

jlacko · May 23, 2021, 6:43pm

The technique you describe is not wrong - in the sense that it will not give incorrect results. Assuming you are working on UK data (EPSG:27700 is the British National grid / Ordnance Survey).

The most typical workflow for the problem you describe - take N locations, and find how many of other points are within a certain distance - would be based on sf::st_buffer() and go like: 1) transform your data (likely in lat-lon points, but I am guessing) to a projected CRS, e.g. EPSG:27700 2) create a new spatial object by buffering your points by 100 meters 3) use sf::st_join() with default join type to get id's and of cycle hire stations within the buffer.

The technique I describe should give you in principle identical results to the one you are using, but might be slightly more efficient (i.e. faster). Of course your mileage may vary, and a test on your data is necessary - but depending on how big is 'very, very big' data frame this should be in principle doable - it is a point in polygon operation, which is bread and butter task for spatial operations, and our tools have been optimized with the like of this in mind...

richardo · May 23, 2021, 8:20pm

Thank you for the reply!

I'm not working on UK data, but similar data from European countries. And everybody might not think the data frames are big, but they at least include hundreds of million observations.

Just so I understand, your suggestion is to do something along with the following?

library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.0.4
#> Warning: package 'tidyr' was built under R version 4.0.4
#> Warning: package 'dplyr' was built under R version 4.0.4
#> Warning: package 'forcats' was built under R version 4.0.4
library(sf)
#> Warning: package 'sf' was built under R version 4.0.4
#> Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
library(spData)
#> Warning: package 'spData' was built under R version 4.0.5
#> To access larger datasets in this package, install the spDataLarge
#> package with: `install.packages('spDataLarge',
#> repos='https://nowosad.github.io/drat/', type='source')`

cycle_hire_P <-  st_transform(cycle_hire, 27700)

cycle_hire_neigbours <- st_buffer(cycle_hire_P, 100) %>% 
  st_join(cycle_hire_P %>% select(geometry)) %>% 
  st_drop_geometry() %>% 
  group_by(id, name, area) %>% 
  summarise(no_neigbours = sum(n()))
#> `summarise()` has grouped output by 'id', 'name'. You can override using the `.groups` argument.

head(cycle_hire_neigbours)
#> # A tibble: 6 x 4
#> # Groups:   id, name [6]
#>      id name               area             no_neigbours
#>   <int> <fct>              <fct>                   <int>
#> 1     1 River Street       Clerkenwell                 1
#> 2     2 Phillimore Gardens Kensington                  1
#> 3     3 Christopher Street Liverpool Street            2
#> 4     4 St. Chad's Street  King's Cross                1
#> 5     5 Sedding Street     Sloane Square               1
#> 6     6 Broadcasting House Marylebone                  1

^{Created on 2021-05-23 by the reprex package (v1.0.0)}

Best, Richard

jlacko · May 24, 2021, 6:58am

Hundreds of millions of points... You have stoked my imagination...

These are the lines I would start from - a plain vanilla stuff, the spatial analysis equivalent of "nobody got fired for buying Coca Cola" in investment management (a safe choice, but unlikely to lead to anything exciting).

library(sf)
library(dplyr)
library(giscoR)

# have ET call home
print(paste(Sys.time(), "work started"))

# a polygon of a UK statistical unit
london <- gisco_get_nuts(nuts_id = 'UKI',
                         resolution = '01') %>% 
  st_transform(27700)

# have ET call home
print(paste(Sys.time(), "randomizing"))

# some random points 
points <- st_sample(london,
                    size = 1e5) %>% 
  st_sf() %>% 
  mutate(id = 1:n()) # id of the point / necessary to sum by

# have ET call home
print(paste(Sys.time(), "buffering"))

# buffers around points
buffers <- points %>% 
  st_buffer(100) # size of buffer in units of yer object / meters in 27700

# have ET call home
print(paste(Sys.time(), "counting"))

# count of points within buffers
result <- buffers %>% 
  st_join(points) %>% 
  st_drop_geometry() %>% # from this point it is not necessary; just counts
  group_by(id.x) %>%  # use id from the x side (buffers)
  tally()

# have ET call home
print(paste(Sys.time(), "all clear!"))

[1] "2021-05-24 08:45:51 work started"
[1] "2021-05-24 08:45:51 randomizing"
[1] "2021-05-24 08:45:59 buffering"
[1] "2021-05-24 08:46:04 counting"
[1] "2021-05-24 08:46:14 all clear!"

In my toy example of 100,000 points - three degrees of magnitude less than your dataset - the buffering took 5 seconds and counting 10 seconds (or so). Around a gigabite of memory was used.

So seemingly fine, but it is a toy case - and I am not certain it would scale well. I would be worried mostly about the memory issue, you may want to experiment a bit and consider offloading your computation to a spatial database - Postres & Postgis come to my mind - so that the problem gets transformed from an available memory one to available disk space one. It will not run as fast, but disk is cheap & plentiful.

richardo · May 26, 2021, 8:06am

Thank you so much for your example, it's very helpful. I will offload the computing to a server for batch processing, but I want to find a fairly efficient solution for all the tasks first (I will also calculate distances to the nearest neighbours).

The data consists of positions of individuals over time, and I want to calculate a measure of concentration by finding out how many other individuals there are within a certain distance.

Best, Richard

jlacko · May 28, 2021, 6:41am

In the use case you describe you may find this piece of PostGIS documentation interesting read: 27. Nearest-Neighbour Searching — Introduction to PostGIS - I believe it matches your issue closely.

One problem that remains is that it requires a somewhat different syntax / not R, but more SQL like. There have been discussions of extending {sf} to handle out of memory operation via PostGIS more efficiently, but I don't recall what is the current status.

You may find it necessary to use SQL - note that the logic remains the same, and even the names of the functions (e.g. ST_Distance) are close. It is by design.

richardo · May 28, 2021, 12:53pm

Thank you for the link and the suggestion! I will look into it. It's a different question than the one I described here, but I also have to find the distance to the nearest neighbour and there I have found that the nngeo package seems to work in a convenient way.

Best, Richard

system · June 18, 2021, 12:53pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.