Help me speed up finding which polygons points lie in using `sf`

I have a dataset of crimes that have only x and y coordinates. I want to find which neighborhood a crime lies in. My current approach is slow and I am looking ideas for making it faster.

Here is a minimized version of my current approach

match_points_to_polygons = function(points, polygons) {
    # A helper function to get the first element of a list, where the list
    # can also be an empty array (something `st_within` creates)
    get_first_element <- function(x){
        if(length(x) == 0) {
            return(NA)
        }
        else if(is.na(x)) {
            return(NA)
        }
        else {
            return(x[[1]])
        }
    }

    # For each point, determine which polygon it lies in. 
    # This is the meat of the computation. As explained above, `st_within`
    # returns a list of arrays. We use `sapply` to apply `get_first_element`
    # to each item in that list, and the results are collected into an array.
    points_list = points %>% 
        st_within(polygons) %>% # The key function!
        sapply(get_first_element) %>%
        return()
}

fname = system.file("shape/nc.shp", package = "sf")
nc = read_sf(fname)
nc_points = st_centroid(nc)
match_points_to_polygons(nc_points, nc) # returns an array

I think there is a better function to use than st_within. It returns a list of arrays, where each element of an array is the row number of the polygon is lies in. This only makes sense if polygons are non-overlapping. I get the feeling its checking every points against every polygon in my dataset. In my dataset, and in this toy example, my polygons are not overlapping. Once a point lies in one polygon, we can stop looking!

Is there a function like st_within that does this less expensive behavior? Or a keyword argument that lets it know polygons are not overlapping?

I think the function gContains in the rgeos library might be better here. It needs sp objects though instead of sf. You can convert sf to sp using as(nc, 'Spatial') for example.

Have you tried doing a spatial join in sf? Not sure how it will compare to your function with regards to speed, but when I've used it in the past to do point-in-polygon operations, it has been pretty speedy for me. It would look something like this where the resulting data frame is each element of nc_points with the corresponding polygon data from nc appended to it.

library(tidyverse)
library(sf)
#> Linking to GEOS 3.7.1, GDAL 2.4.0, PROJ 5.2.0

fname = system.file("shape/nc.shp", package = "sf")
nc = read_sf(fname)
nc_points = st_centroid(nc) %>% 
  transmute(point_id = row_number())

nc_point_in_poly <- st_join(nc_points, nc, join = st_within)
#> although coordinates are longitude/latitude, st_within assumes that they are planar

head(nc_point_in_poly)
#> Simple feature collection with 6 features and 15 fields
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: -81.49826 ymin: 36.36145 xmax: -76.0275 ymax: 36.49101
#> epsg (SRID):    4267
#> proj4string:    +proj=longlat +datum=NAD27 +no_defs
#> # A tibble: 6 x 16
#>   point_id  AREA PERIMETER CNTY_ CNTY_ID NAME  FIPS  FIPSNO CRESS_ID BIR74
#>      <int> <dbl>     <dbl> <dbl>   <dbl> <chr> <chr>  <dbl>    <int> <dbl>
#> 1        1 0.114      1.44  1825    1825 Ashe  37009  37009        5  1091
#> 2        2 0.061      1.23  1827    1827 Alle… 37005  37005        3   487
#> 3        3 0.143      1.63  1828    1828 Surry 37171  37171       86  3188
#> 4        4 0.07       2.97  1831    1831 Curr… 37053  37053       27   508
#> 5        5 0.153      2.21  1832    1832 Nort… 37131  37131       66  1421
#> 6        6 0.097      1.67  1833    1833 Hert… 37091  37091       46  1452
#> # … with 6 more variables: SID74 <dbl>, NWBIR74 <dbl>, BIR79 <dbl>,
#> #   SID79 <dbl>, NWBIR79 <dbl>, geometry <POINT [°]>

Created on 2019-03-13 by the reprex package (v0.2.1)

3 Likes

I support @mfherman suggestion of using sf::st_join() - it is fast.

To give an example have a look at this case study, in which I analyze point data (locations of bars in Prague, but it applies to any point data - the bars are just to sex it up a bit) in context of grid cells.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.