Parsing Ranges of Values

chris.prener · March 9, 2019, 2:19am

TL;DR:
I have a data problem where I need to cross-walk data about streets (the low and high address ranges on each side of a block) with address-level incidents. For a few different reasons, a spatial solution based on proximity doesn't work. I have a working solution (reprex below), but am wondering if it could be more efficient. Feedback would be very much appreciated!

Problem:
We need a way to match incidents (that have address-level data) to the blockface they occur on. By blockface, I mean the houses on either side of a street between two cross streets (i.e. the 100-block of Main Street between 1st and 2nd Avenues).

Typically, census block and city block shapefiles do not represent a blockface. Instead, they have parts of up to four different streets, typically representing half of a blockface for each of the included streets. So a block bounded by Main Street on the south, 1st Avenue on the west, Washington Street on the north, and 2nd Avenue on the east would contain addresses from each of the four streets that form the block's boundaries. Joining addresses to traditional blocks therefore produces entities that are distinct from the way we think about a blockface.

We need to be able to do two things - identify incidents that occur on "marked" blocks, and then also produce counts of incidents for all blocks.

The census bureau pubishes something that could be helpful - line data with block segements. These include the low and high address numbers for both sides of the street (i.e. the blockface). However, I need a quick way to apply the blockface identification numbers to individual addresses in the incident data.

Sample Raw Data:

> blocks
# A tibble: 4 x 7
   bfId rightLow rightHigh leftLow leftHigh street  marked
  <dbl>    <dbl>     <dbl>   <dbl>    <dbl> <chr>   <lgl> 
1     1      400       498     401      499 Main St TRUE  
2     2      500       598     501      599 Main St FALSE 
3     3      600       698     601      699 Main St FALSE 
4     4      700       798     701      799 Main St FALSE 
> 
> incidents
# A tibble: 5 x 4
  callId address     date    call           
   <dbl> <chr>       <chr>   <chr>          
1    101 424 Main St 1/1/14  Graffiti       
2    102 447 Main St 3/6/14  Graffiti       
3    103 504 Main St 5/12/14 Pothole        
4    104 667 Main St 4/19/14 Lights Out     
5    105 773 Main St 2/12/14 Vacant Building

Sample Output Data:

> incidentsWithBlock
# A tibble: 5 x 6
  callId address     date    call             bfId marked
   <dbl> <chr>       <chr>   <chr>           <dbl> <lgl> 
1    101 424 Main St 1/1/14  Graffiti            1 TRUE  
2    102 447 Main St 3/6/14  Graffiti            1 TRUE  
3    103 504 Main St 5/12/14 Pothole             2 FALSE 
4    104 667 Main St 4/19/14 Lights Out          3 FALSE 
5    105 773 Main St 2/12/14 Vacant Building     4 FALSE 
> 
> countsByBlock
# A tibble: 4 x 8
   bfId rightLow rightHigh leftLow leftHigh street  marked count
  <dbl>    <dbl>     <dbl>   <dbl>    <dbl> <chr>   <lgl>  <int>
1     1      400       498     401      499 Main St TRUE       2
2     2      500       598     501      599 Main St FALSE      1
3     3      600       698     601      699 Main St FALSE      1
4     4      700       798     701      799 Main St FALSE      1

reprex:
(this is also posted as a gist)

# dependencies
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(purrr)
library(stringr)
library(tidyr)

# custom function to create a list-column with every other integer value between the low 
# and high values for a given side of the blockface
parse_range <- function(x){
  
  # convert item to numeric
  vector <- as.numeric(x)
  
  # expand vector to include every other integer between low and high values
  out <- seq.int(from = vector[1], to = vector[2], by = 2)
  
  # return output
  return(out)
  
}

# sample block data
blocks <- tibble(
  bfId = c(1,2,3,4),
  rightLow = c(400,500,600,700),
  rightHigh = c(498,598,698,798),
  leftLow = c(401,501,601,701),
  leftHigh = c(499,599,699,799),
  street = c("Main St", "Main St", "Main St", "Main St"),
  marked = c(TRUE, FALSE, FALSE, FALSE)
)

# sample incident data
incidents <- tibble(
  callId = c(101, 102, 103, 104, 105),
  address = c("424 Main St", "447 Main St", "504 Main St", "667 Main St", "773 Main St"),
  date = c("1/1/14", "3/6/14", "5/12/14", "4/19/14", "2/12/14"),
  call = c("Graffiti", "Graffiti", "Pothole", "Lights Out", "Vacant Building")
)

# convert ranges into individual records, right side of street
blocks %>%
  select(-c(leftLow, leftHigh)) %>%
  mutate(
    rightRange = str_split(string = str_c(as.character(rightLow), "-", as.character(rightHigh)), pattern = "-")
  ) %>%
  mutate(rightRange = map(.x = rightRange, .f = parse_range)) %>% 
  unnest() %>%
  select(-c(rightLow, rightHigh)) %>%
  rename(house = rightRange) %>%
  select(bfId, house, street, marked) -> right

# convert ranges into individual records, left side of street
blocks %>%
  select(-c(rightLow, rightHigh)) %>%
  mutate(
    leftRange = str_split(string = str_c(as.character(leftLow), "-", as.character(leftHigh)),  pattern = "-")
  ) %>%
  mutate(leftRange = map(.x = leftRange, .f = parse_range)) %>% 
  unnest() %>%
  select(-c(leftLow, leftHigh)) %>%
  rename(house = leftRange) %>%
  select(bfId, house, street, marked) -> left

# combine left and rigt side of street data
bind_rows(right, left) %>%
  arrange(bfId, house) %>%
  mutate(address = str_c(house, street, sep = " ")) -> master

# combine master and incident data to apply blockface ids and 
# the logical indicator of a "marked" block to each incident
master %>%
  select(bfId, address, marked) %>%
  left_join(incidents, ., by = "address") -> incidentsWithBlock

# calculate counts per blockface
incidentsWithBlock %>%
  group_by(bfId) %>%
  summarise(count = n()) %>%
  left_join(blocks, ., by = "bfId") -> countsByBlock

^{Created on 2019-03-08 by the reprex package (v0.2.1)}

andresrcs · March 9, 2019, 3:06am

How about fuzzy joins?

library(dplyr)
library(tibble)
library(stringr)
library(fuzzyjoin)

blocks <- tibble(
    bfId = c(1,2,3,4),
    rightLow = c(400,500,600,700),
    rightHigh = c(498,598,698,798),
    leftLow = c(401,501,601,701),
    leftHigh = c(499,599,699,799),
    street = c("Main St", "Main St", "Main St", "Main St"),
    marked = c(TRUE, FALSE, FALSE, FALSE)
)

incidents <- tibble(
    callId = c(101, 102, 103, 104, 105),
    address = c("424 Main St", "447 Main St", "504 Main St", "667 Main St", "773 Main St"),
    date = c("1/1/14", "3/6/14", "5/12/14", "4/19/14", "2/12/14"),
    call = c("Graffiti", "Graffiti", "Pothole", "Lights Out", "Vacant Building")
)

joined_table <- incidents %>% 
    mutate(number = as.numeric(str_extract(address, pattern = "^\\d+(?=\\s)")),
           street = str_extract(address, pattern = "(?<=\\d\\s).*")) %>% 
    fuzzy_left_join(blocks,
                    by = c(
                        "number" = "rightLow",
                        "number" = "leftHigh",
                        "street" = "street"
                    ),
                    match_fun = list(`>=`, `<=`, `==`) )
joined_table %>% 
    select(callId, address, date, call, bfId, marked)
#> # A tibble: 5 x 6
#>   callId address     date    call             bfId marked
#>    <dbl> <chr>       <chr>   <chr>           <dbl> <lgl> 
#> 1    101 424 Main St 1/1/14  Graffiti            1 TRUE  
#> 2    102 447 Main St 3/6/14  Graffiti            1 TRUE  
#> 3    103 504 Main St 5/12/14 Pothole             2 FALSE 
#> 4    104 667 Main St 4/19/14 Lights Out          3 FALSE 
#> 5    105 773 Main St 2/12/14 Vacant Building     4 FALSE

joined_table %>%
    select(bfId, rightLow, rightHigh, leftLow, leftHigh, street = street.x,  marked) %>% 
    group_by_all() %>% 
    summarise(count = n())
#> # A tibble: 4 x 8
#> # Groups:   bfId, rightLow, rightHigh, leftLow, leftHigh, street [4]
#>    bfId rightLow rightHigh leftLow leftHigh street  marked count
#>   <dbl>    <dbl>     <dbl>   <dbl>    <dbl> <chr>   <lgl>  <int>
#> 1     1      400       498     401      499 Main St TRUE       2
#> 2     2      500       598     501      599 Main St FALSE      1
#> 3     3      600       698     601      699 Main St FALSE      1
#> 4     4      700       798     701      799 Main St FALSE      1

^{Created on 2019-03-09 by the reprex package (v0.2.1)}

chris.prener · March 13, 2019, 3:27pm

This seems far more straight forward - I'll have to take a minute to digest the regex but otherwise seems to be a much better solution. Thanks @andresrcs!

andresrcs · March 13, 2019, 4:40pm

Your welcome!
In case it helps
pattern = "^\\d+(?=\\s)
This means
^ = The beginning of the string
\\d+ = One or more digits
(?=\\s) = Followed by an empty space (positive look-ahead but not extract)

pattern = "(?<=\\d\\s).*")
(?<=\\d\\s) = Preceded by a digit and an empty space (positive look-behind but not extract)
.* = Any character 0 or more times

chris.prener · March 13, 2019, 4:43pm

Regex is my achilles heel - I know enough to be dangerous but I really appreciate the breakdown. Just posted this thread to my research team's message board - going to see how well this process scales to 303k addresses!

andresrcs · March 13, 2019, 4:47pm

That could be a little embarrassing, I'm not familiar with U.S. addresses so I haven't polished the regexes enough to be generalizable.

chris.prener · March 13, 2019, 4:48pm

I'm not too worried about the regex part - we've got a ton of code already (and a whole package) for parsing street addresses. The fuzzjoin piece is the crux for us.

chris.prener · April 3, 2019, 4:48pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.