More efficient code with purrr or vectorized operations

I am trying to determine whether a given latitude/longitude (lat/long) lands in a geofenced area (these 2 pieces of information, at the moment, are contained within 2 separate dataframes. The lat/long data is contained below in a dataframe called checks. The geofenced areas are contained in a dataframe called container.

library(dplyr)
library(purrr)
library(stringr)
checks <- structure(list(name = c("a", "b", "c"), latitude = c(43.0988803401934, 
                                                               42.1276251670733, 40.2629180055808), longitude = c(74.5229195309954, 
                                                                                                                  75.6848258586636, 73.2005188397627)), row.names = c(NA, -3L), class = c("spec_tbl_df", 
                                                                                                                                                                                          "tbl_df", "tbl", "data.frame"))

container <- structure(list(id = 1:3, name = c("location_1", "location_2", 
                                               "location_3"), lat = c(40.7130015437452, 40.8207479655434, 42.6486260163842
                                               ), long = c(74.1955991671793, 75.0798275814, 74.8702938787559
                                               ), box_lat_north = c(41.7630015437452, 41.8707479655434, 43.6986260163842
                                               ), box_lat_south = c(39.6630015437452, 39.7707479655434, 41.5986260163842
                                               ), box_long_west = c(75.2455991671793, 76.1298275814, 75.9202938787559
                                               ), box_long_east = c(73.1455991671793, 74.0298275814, 73.8202938787559
                                               )), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, -3L))
                                   

For each row in checks, I want to know whether the lat/long appears within any of the geofenced areas as specified in container. If they do, I want to log this information. At the moment, here is how I am approaching this problem:

single_lookup <- function(the_name, the_lat, the_long, id){
  
  print(id)
  
  one_row <- container %>% 
    filter(str_detect(name, the_name), lat == the_lat, long == the_long)
  
  found <- checks %>% 
    filter(latitude < one_row$box_lat_north, latitude > one_row$box_lat_south,
           longitude < one_row$box_long_west, longitude > one_row$box_long_east) %>% 
    mutate(hot_zone = the_name) %>% 
    select(name, hot_zone)
}

results <- container %>% 
  mutate(res = pmap(list(name, lat, long, id), single_lookup))  

results

tabled <- map_dfr(results$res, bind_rows)

tabled

This solution works, but it is rather slow when running this function on millions of observations in both checks and container. I would like to speed up this process, if possible.

One idea is to use furrr to run this analysis in parallel. Is this among the best options? Is it possible to vectorize these operations using list columns or otherwise? I am open to any and all suggestions!

Thanks in advance for your time and help!

Hope

Hi,

Sounds like an interesting problem!

  • first question, is the box-area always a square? Or can it be a rectangle too?
  • second, can the areas overlap, meaning a point can have multiple boxes it falls in
  • third, do you have an idea how likely it is for a point to fall into a square? I assume most of the tests would return a negative result

Grtz

I'm sure parallel-izing will help some and probably be quicker to check.
If you only need a 2-3x, this is probably the way to go.

If that's not enough, you might consider using GeoJSON libraries and encoding your data in that way. Particularly packages where the processing is done in C/C++.
If you need orders of magnitude speedup, that'd be the way to go.
Plus using GeoJSON opens the door for a lot of other packages, both R and spatial analysis in general.

I have done exactly this operation but in a database (MongoDB to be exact), not in R.
Hopefully someone with more experience will chime in.
What I found for R was this SO post.

To answer your questions:

  1. We've coded it to always be a box. But if we adjust a few numbers, the fenced in area could be a rectangle.

  2. Yes, that is possible but shouldn't be that frequent.

  3. Based on our sample, it's about 20% likely that a point will fall into a square.

Thanks for the interest!

We have parallelized the code using furrr. The speed-up is rather marked in our situation (approximately 6x; we're throwing a lot of cores at it). At the moment, this seems tolerable, but if it's not, we will be sure to look into the geoJSON package. Thank you so much for the suggestion!

I'm not sure how this will compare for speed, but you might want to try converting your points and polygon boxes to sf objects and then use a spatial join to determine if each point falls in each box. Below is an example with your sample data. Apologies for the tortured reshaping to create the polygons -- I'm assuming you could do this much more simply as you initially set up your data. But the spatial join is just a one liner once you have your two sf objects!

st_join(points, polygons, join = st_within)

I describe this method in a little more detail in a blog post here: https://mattherman.info/blog/point-in-poly/

library(tidyverse)
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.1.3, PROJ 4.9.3

checks <- structure(list(name = c("a", "b", "c"), latitude = c(43.0988803401934, 
                                                               42.1276251670733, 40.2629180055808), longitude = c(74.5229195309954, 
                                                                                                                  75.6848258586636, 73.2005188397627)), row.names = c(NA, -3L), class = c("spec_tbl_df", 
                                                                                                                                                                                          "tbl_df", "tbl", "data.frame"))

container <- structure(list(id = 1:3, name = c("location_1", "location_2", 
                                               "location_3"), lat = c(40.7130015437452, 40.8207479655434, 42.6486260163842
                                               ), long = c(74.1955991671793, 75.0798275814, 74.8702938787559
                                               ), box_lat_north = c(41.7630015437452, 41.8707479655434, 43.6986260163842
                                               ), box_lat_south = c(39.6630015437452, 39.7707479655434, 41.5986260163842
                                               ), box_long_west = c(75.2455991671793, 76.1298275814, 75.9202938787559
                                               ), box_long_east = c(73.1455991671793, 74.0298275814, 73.8202938787559
                                               )), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, -3L))


# convert points to sf object
checks_sf <- checks %>% 
  st_as_sf(
    coords = c("longitude", "latitude"),
    agr = "constant",
    crs = 4326,    
    stringsAsFactors = FALSE
  )

# reshape box lat/long, convert to sf points, then sf polygons
container_sf <- container %>% 
  select(-id, -lat, -long, location = name) %>% 
  pivot_longer(
    cols = contains("long"),
    values_to = "long"
    ) %>% 
  pivot_longer(
    cols = contains("lat"),
    values_to = "lat"
    ) %>% 
  select(location, lat, long) %>% 
  group_by(location) %>% 
  mutate(
    id = as.factor(row_number()),
    id = fct_relevel(id, "1", "2", "4", "3")) %>% 
  arrange(location, id) %>%
  st_as_sf(coords = c("long", "lat"), crs = 4326) %>% 
  group_by(location) %>% 
  summarise(geometry = st_combine(geometry)) %>%
  st_cast("POLYGON")
#> New names:
#> * name -> name...2
#> * name -> name...4

# find points in polygons
st_join(checks_sf, container_sf, join = st_within)
#> although coordinates are longitude/latitude, st_within assumes that they are planar
#> Simple feature collection with 3 features and 2 fields
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: 73.20052 ymin: 40.26292 xmax: 75.68483 ymax: 43.09888
#> epsg (SRID):    4326
#> proj4string:    +proj=longlat +datum=WGS84 +no_defs
#> # A tibble: 3 x 3
#>   name             geometry location  
#>   <chr>         <POINT [°]> <chr>     
#> 1 a     (74.52292 43.09888) location_3
#> 2 b     (75.68483 42.12763) location_3
#> 3 c     (73.20052 40.26292) location_1

Created on 2019-07-15 by the reprex package (v0.3.0)