I have a dataset of crimes that have only x and y coordinates. I want to find which neighborhood a crime lies in. My current approach is slow and I am looking ideas for making it faster.
Here is a minimized version of my current approach
match_points_to_polygons = function(points, polygons) {
# A helper function to get the first element of a list, where the list
# can also be an empty array (something `st_within` creates)
get_first_element <- function(x){
if(length(x) == 0) {
return(NA)
}
else if(is.na(x)) {
return(NA)
}
else {
return(x[[1]])
}
}
# For each point, determine which polygon it lies in.
# This is the meat of the computation. As explained above, `st_within`
# returns a list of arrays. We use `sapply` to apply `get_first_element`
# to each item in that list, and the results are collected into an array.
points_list = points %>%
st_within(polygons) %>% # The key function!
sapply(get_first_element) %>%
return()
}
fname = system.file("shape/nc.shp", package = "sf")
nc = read_sf(fname)
nc_points = st_centroid(nc)
match_points_to_polygons(nc_points, nc) # returns an array
I think there is a better function to use than st_within. It returns a list of arrays, where each element of an array is the row number of the polygon is lies in. This only makes sense if polygons are non-overlapping. I get the feeling its checking every points against every polygon in my dataset. In my dataset, and in this toy example, my polygons are not overlapping. Once a point lies in one polygon, we can stop looking!
Is there a function like st_within that does this less expensive behavior? Or a keyword argument that lets it know polygons are not overlapping?
I think the function gContains in the rgeos library might be better here. It needs sp objects though instead of sf. You can convert sf to sp using as(nc, 'Spatial') for example.
Have you tried doing a spatial join in sf? Not sure how it will compare to your function with regards to speed, but when I've used it in the past to do point-in-polygon operations, it has been pretty speedy for me. It would look something like this where the resulting data frame is each element of nc_points with the corresponding polygon data from nc appended to it.
I support @mfherman suggestion of using sf::st_join() - it is fast.
To give an example have a look at this case study, in which I analyze point data (locations of bars in Prague, but it applies to any point data - the bars are just to sex it up a bit) in context of grid cells.