X and Y Coordinate Label Matching

aaronbaggett · February 14, 2018, 9:47pm

Greetings RStudio Community:

I have a data frame of x and y coordinates representing baseball pitch locations (df_2). I also have a reference data frame containing a region label as well as the corresponding xmin, xmax, ymin, and ymax region parameters (df_1). I'm trying to apply the value of df_1$region to df_2$region when df_2$x and df_2$y are between df_1$xmin & df_1$xmax AND df_1$ymin & df_1$ymax.

I can get the code to run using a nasty series of nested ifelse statements, but ideally the solution would be much faster and more elegant. I’ve tried using purrr and a for loop to no avail.

# Objective:
# Match x and y in df_2 with corresponding region number in df_1

library(tidyverse)

# df_1: region labels and coordinates
load(url("http://aaronbaggett.com/data/df_1.Rda"))

# df_2: x and y coordinates
load(url("http://aaronbaggett.com/data/df_2.Rda"))

# Attempt 1: Using purrr
df_2 %>% 
  mutate(region = map2_dbl(x, y,
    ~df_1$region[.x >= df_1$xmin & 
        .x <= df_1$xmax &
        .y >= df_1$ymin & 
        .y <= df_1$ymax]))
#> Error in mutate_impl(.data, dots): Evaluation error: Result 52 is not a length 1 atomic vector.

df_2[52, ]
#> # A tibble: 1 x 3
#>   region       x     y
#>    <dbl>   <dbl> <dbl>
#> 1      0 -0.0200  1.83

One potential problem with the df_1 region parameters is that when a pitch is directly over one of the borders (see blue lines in the figure below), the function isn't sure to which region those pitch coordinates should be assigned. For example, df_2[52, ] could be in either region 27 or 21. The output snippet below is what df_2 should look like after the iteration.

df_2
#> # A tibble: 100 x 3
#>    region      x     y
#>     <dbl>  <dbl> <dbl>
#>  1      25 -1.37  1.42 
#>  2      28  0.405 1.21 
#>  3      31 -1.37  0.682
#>  4      36  1.58  0.912
#>  5      10  0.304 3.50 
#>  6      14 -0.906 3.03 
#>  7      23  0.620 2.41 
#>  8      9 -0.202 3.38 
#>  9      14 -0.987 2.93 
#> 10      8 -1.02  3.77 
#> # ... with 90 more rows

Any help is appreciated.

EconomiCurtis · February 15, 2018, 9:57am

Maybe something using cut() ?

load(url("http://aaronbaggett.com/data/df_1.Rda"))
load(url("http://aaronbaggett.com/data/df_2.Rda"))

library(dplyr)
df_1 = df_1 %>% 
  mutate(
    x_breaks = cut((xmin + xmax) / 2, breaks=seq(-2,2,length=7)),
    y_breaks = cut((ymin + ymax) / 2, breaks=seq(0.5,4.5,length=7)),
  )

df_2 = df_2 %>% 
  mutate(
    x_breaks = cut(df_2$x, breaks=seq(-2,2,length=7)),
    y_breaks = cut(df_2$y, breaks=seq(0.5,4.5,length=7)),
  ) %>% 
  select(-region) %>% 
  left_join(
    df_1 %>% select(region, x_breaks, y_breaks)
  ) %>% 
  mutate(
    region = as.integer(region)
  ) %>% 
  select(-y_breaks, -x_breaks)
#> Joining, by = c("x_breaks", "y_breaks")

aaronbaggett · February 15, 2018, 3:57pm

Excellent! Thanks, @EconomiCurtis.