I have a dataset of crimes that have only x and y coordinates. I want to find which neighborhood a crime lies in. My current approach is slow and I am looking ideas for making it faster.
Here is a minimized version of my current approach
match_points_to_polygons = function(points, polygons) {
# A helper function to get the first element of a list, where the list
# can also be an empty array (something `st_within` creates)
get_first_element <- function(x){
if(length(x) == 0) {
return(NA)
}
else if(is.na(x)) {
return(NA)
}
else {
return(x[[1]])
}
}
# For each point, determine which polygon it lies in.
# This is the meat of the computation. As explained above, `st_within`
# returns a list of arrays. We use `sapply` to apply `get_first_element`
# to each item in that list, and the results are collected into an array.
points_list = points %>%
st_within(polygons) %>% # The key function!
sapply(get_first_element) %>%
return()
}
fname = system.file("shape/nc.shp", package = "sf")
nc = read_sf(fname)
nc_points = st_centroid(nc)
match_points_to_polygons(nc_points, nc) # returns an array
I think there is a better function to use than st_within
. It returns a list of arrays, where each element of an array is the row number of the polygon is lies in. This only makes sense if polygons are non-overlapping. I get the feeling its checking every points against every polygon in my dataset. In my dataset, and in this toy example, my polygons are not overlapping. Once a point lies in one polygon, we can stop looking!
Is there a function like st_within
that does this less expensive behavior? Or a keyword argument that lets it know polygons are not overlapping?